Task 2: Alignments

Sequence alignments are a good start to analyse a new protein sequence. Alignment search tools such as Blast, Psiblast or hhblits can be used to find related protein sequences.

Lab journal task 2

Sequence Searches

The HFE protein sequence was used to conduct sequence searches with the three different search tools Blast, Psiblast and hhblits. CATH and COPS, two protein structure classification databases, were used to evaluate and compare the results of the different tools.

CATH is a database that hierarchically classifies all domain structures from PDB. This hierarchy consists of four major levels: 1. Class 2. Architecture 3. Topology 4. Homologous superfamily

We compared the first level of the CATH hierarchy (Class) to check whether the hits have the same fold class as the query. The class of a domain is determined by its secondary structure composition (mostly alpha, mostly beta, mixed alpha/beta or few secondary structures). Each domain of a protein is assigned to a CATH class, this means, that one query protein can consist of several fold classes, one for each domain. The HFE protein consists of two domains that belong to the CATH fold class 2 and 3.

Another database for protein structure clustering is COPS. It consists of different structural groups that contain all proteins with a structural similarity above a threshold. For example, all proteins in the L30 group have a structural similarity of at least 30% percent. This means that if two proteins are not in the same L30 group, they have less than 30% structural similarity. The L30 groups of the hits were determined to check if they are in the same group as the HFE protein or not.

Blast

The Blast search in the big_80 database yielded 1504 hits.

**Figure 1:** E-value and identity distribution of the Blast search hits.
a) E-value distribution of the results of the Blast search against big_80 using standard parameters. The x-axis shows the natural logarithm of the e-value and the y-axis the frequency. The better the e-value the smaller the log(e-value).	b) Identity distribution of the results of the Blast search against big_80 using standard parameters. The x-axis shows the relative identity between the query and the hits.

</figure>

<xr id="blast distribution"/> shows the e-value and identity distributions of all unfiltered Blast search hits. There are only a few hits with a very low e-value of 1e-160, but most significant hits have an e-value between 1e-40 and 1e-50 (left small peak in a)). Most results have a worse e-value above 1e-20. The histogram of the sequence identity b) shows that nearly all hits have an identity between 20% to 40% to the query.

The Blast results only contain 13 PDB IDs. Nevertheless, the number of shared CATH classes between the reference sequence HFE and the hits were computed as described in the Lab journal task 2.

<css> table.colBasic2 { margin-left: auto; margin-right: auto; border: 2px solid black; border-collapse:collapse; width: 40%; } .colBasic2 th,td { padding: 3px; border: 2px solid black; } .colBasic2 td { text-align:left; } .colBasic2 tr th { background-color:#efefef; color: black;} .colBasic2 tr:first-child th { background-color:#adceff; color:black;} </css>

**Table 1:** Histogram of the number of shared CATH classes between HFE and the Blast hits.
number of same fold classes	frequency
0	4
1	6
2	3

</figtable>

The <xr id="blast cath"/> shows, that only 3 of the 13 pdb hist share all domain fold classes with HFE, 6 of the hits have one domain fold class in commom with the reference and 4 even none. Besides, only 6 of the 13 pdb hits are in the same COPS L30 group as the HFE protein. Those numberst indicate, that some of the Blast hits are not really related to HFE and thus false positives.

Psiblast

Four different Psiblast runs against big_80 were executed using the following four parameter combinations:

2 iterations and default e-value cutoff 0.002 (j=2, h=2)
2 iterations and e-value cutoff 10E-10 (j=2, h=10E-10)
10 iterations and default e-value cutoff 0.002 (j=10, h=2)
10 iterations and e-value cutoff 10E-10 (j=10, h=10E-10)

The e-value and sequence identity distributions of the hits from the four runs were compared to identify the best parameter combination.

**Figure 2:** E-value and percent sequence identity distribution of the results from the four Psiblast runs against big_80.
a) Boxplot of the log(e-value) of the four different Psiblast searches against big_80: 2 iterations with default e-value cutoff (0.002) and 10E-10 and 10 iterations with default e-value cutoff (0.002) and 10E-10. The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points above the whiskers are outliers.	b) Boxplot of the identity of the results of the four different Psiblast search against big_80: 2 iterations with default e-value cutoff (0.002) and 10E-10 and 10 iterations with default e-value cutoff (0.002) and 10E-10. The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points above and below the whiskers are outliers.

</figure>

**Table 2:** Comparison of the results of the four Psiblast searches against big_80.
		Psiblast parameters
		j=2, h=0.002	j=2, h=10E-10	j=10, h=0.002	j=10, h=10E-10
# hits		2564	3163	2601	3343
# PDB hits		66	72	56	66
CATH classes in common with HFE	0	21	21	20	21
	1	42	48	33	42
	2	3	3	3	3
# hits within same L30 group		6	6	6	6

</figtable>

<xr id="psiblast summary"/> summarises the results of the different Psiblst runs. The run with default parameter settings (j=2 and h=0.002) found the least hits. Nevertheless, the boxplot of the e-value distribution (<xr id="psiblast boxplot"/> a)) shows, that the default parameters can be considered the best: the median of the log(e-value) is smaller higher than the median of the other 3 psiblast runs. This indicates that the percentage of hits with a low and thus significant e-value is higher in the default psiblast run than in the other three. Besides, the default parameters also led to a higher percent sequence identity as indicated by the median of the cyan box in <xr id="psiblast boxplot"/> b).

The number of PDB hits varies a bit among the different runs, but all found 3 proteins with the same two CATH classes as the query. There are 1095 proteins in the same COPS L30 group as the query HFE. Since the big_80 datatbase does not contain all proteins from pdb, the quality of the program cannot be assesed using the COPS L30 group of HFE. Nevertheless, of the found pdb hits only 6 proteins that belong to the same L30 group.

PDB search

To get more pdb ids, we also did 4 psiblast searches against the pdb seqres database using the profiles created with big_80. Those four result lists were compared as well.

**Figure 3:** Evalue and sequence identity distributions of the hits from the four Psiblast runs against pdb_seqres.
b) Boxplot of the log(e-value) of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default e-value cutoff (0.002) and 10E-10 and 10 iterations with default e-value cutoff(0.002) and 10E-10). The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points below the whiskers are outliers.	Boxplot of the identity of the results of the four different Psiblast searches against pdb_seqres using the checkpoint files from the runs before. (2 iterations with default e-value cutoff (0.002) and 10E-10 and 10 iterations with default e-value cutoff(0.002) and 10E-10). The thick black line inside the box is the median and the upper and lower end of the box the 25% and 75% quantils. The single points above and below the whiskers are outliers.

</figure>

**Table 3:** Comparison of the results of the four Psiblast searches against pdb_seqres.
		Psiblast parameters
		j=2, h=0.002	j=2, h=10E-10	j=10, h=0.002	j=10, h=10E-10
# hits		4307	4814	3874	4058
CATH classes in common with HFE	0	561	586	529	539
	1	2798	3280	2397	2571
	2	948	948	948	948
# hits within same L30 group		1047	1053	1045	1052

</figtable>

The e-value and sequence identity distributions in <xr id="psiblast pdb"/> do not show a significant difference in the results, because the medians of all four e-value distributions are nearly the same. The last run with the parameter combination of j=10 and h=10E-10 has a higher percentage of low e-value hits, because the 25% quantil is below those of the other distributions. But the run with j=2 and h=10E-10 lead to more hits with a higher sequence identity, because the median slightly lies above the other medians. Apart from the query sequence, there were no pdb hits with a sequence identity with more than 40% found.

Evaluating the CATH and SCOP annotations of the found hits revealed that all four searches found 948 hits with the same 2 CATH classes as the reference. Besides, all runs could find nearly all 1095 proteins that are in the same COPS L30 group as HFE. Therefore,the decision which of the parameter settings leads to the best results cannot be based on the pdb hits, since the differences are minimal. Nevertheless, j=10, h=0.002 had the least hits, but the highest percentage of hits in the same L30 group as the query. It is thus best suited to find a high number of significant hits.

**Figure 4:** Venn diagram of the overlap between all hits from the four different searches. J1H1: 10 iterations and e-value cutoff 10E-10, J1H2: 10 iterations and e-value cutoff 2E-3, J2H1: 2 iterations and e-value cutoff 10E-10, J2H2: 2 iterations and e-value cutoff 2E-3.

</figure>

We also computed the hits overlap of the four different Psiblast searches in big_80 to see how comparable the found hits are. The overlap is illustrated in the Venn diagram in <xr id="psiblast venn"/>.

Additionally, we computed the overlap for the 100 best hits (based on e-value) of each of the runs. They were exactly the same in all four runs. Thus, the parameter settings only matter if a high number of hits is needed, but not if onyl the best hits of a search are used in further analyses.

HHblits

We executed hhblits with default parameters and searched the uniprot20_02Sep11 database. HHblits is an alignment search tool that searches in a database of clustered sequences. For each cluster-representative that matches the query, not only the hit itself, but all other cluster-members are reported in the results list as well. Thus, hhblits finds much more sequences than Blast and Psiblast. We found 2861 clusters and 37908 sequences in total. We used all sequences for the analysis, because one of hhblits most important features is the high number of good results.

**Figure 5:** E-value and identity distribution of the hhblits search hits.
a) E-value distribution of the results of the hhblits search with default parameters using the using the uniprot20_02Sep11 database. The x-axis shows the natural logarithm of the e-value and the y-axis the frequency. The better the e-value the smaller the log10(e-value).	b) Identity distribution of the results of the HHblits search in the uniprot20_02Sep11 database using default parameters. The x-axis shows the relative identity between the query and the hits.

</figure>

<xr id ="hhblits distribution"/> a) shows the distribution of the log10 of the e-values of the hhblits hits. It is remarkable that HHblits finds so many hits with a very low e-value of 1E-130. One reason is, that the best hit belongs to a very large cluster. Nevertheless, in comparison to the Blast e-value distribution ( <xr id="blast distribution"/> a)) is the percentage of hits with a high e-value much lower in the hhblits results.

There are several hits with a very low sequence identity to the query (see <xr id="hhblits distribution"/> b)), but the distribution contains two peaks around 20% and 40% identity that distant and more closely related hits.

Comparison

**Figure 6:** E-value and identity distribution of the Blast, Psiblast and hhblits search hits.
a) E-value distribution of the Blast, Psiblast and hhblits hits with default parameter searches.	b) Identity distribution of the Blast, Psiblast and hhblits hits default parameters searches

</figure>

We used the results of the Psiblast run with 2 iterations and an e-value cutoff of 2e-3, because we also used default parameters for Blast and hhblits and because we also received the best results with this parameters. We compared the e-value and identity distributions. <xr id ="comparison"/> a) clearly shows that Blast has the lowest e-values, but the percentage of low e-value hits is higher in the hhblits results. The median sequence identity has the highest value among the Blast hits.

**Figure 7:** Venn diagram of the overlap between all hits from the Blast, Psiblast and hhblits runs with default parameters.

</figure>

The overlap between all hits and the top 100 hits are shown in <xr id ="comparison venn"/>.

**Table 4:** Rank of the top 6 PSiblast hits in the Blast and hhblits results.
search tool	rank of hit
Psiblast	1	2	3	4	5	6
Blast	1	3	2	52	109	81
hhblits	2	not found	not found	not found	not found	not found

</figtable>

We tried to identify the 6 best Psiblast hits with the lowest e-values in the results of the Blast and hhblits searches, see <xr id ="best hits"/>- The top 6 Psiblast hits are all present in the Blast results and the first 3 hits are the same. This confirms that Blast and Psiblast are very similar. Hhblits, on the contrary, has definitely not the same results as Blast and Psiblast. From the top 6 Psiblast hits, only the first on is present in the second best cluster. This proves, that hhblis finds more distant related hits, which is also indicated by the low median sequence identity of the hhblits hits.

The best 5 hits from all three search tools belong to the MHC class I family and are thus correct hits.

Multiple sequence alignments

The following data sets were chosen for the Multiple Sequence Alignments:

Below 30% SeqID:

AC Number	Sequence Identity
F6JYA9	0.29
D9J389	0.28
D5MSB3	0.27
B3FRK2	0.20
3ov6_A	0.22
1p7k_L	0.21
H0Y1D0	0.20
B3FRK3	0.18
Q8HWL2	0.17

Above 60% SeqID:

G5BQE5 0.67

AC Number	Sequence Identity
B4DDZ1	0.83
Q5EEZ1	0.76
G3THV5	0.75
G1MBW1	0.73
H0VAR7	0.72
F1PX48	0.71
G1T7D7	0.70
O35799	0.68

Whole range of SeqID:

AC Number	Sequence Identity
B4DDZ1	0.83
Q5EEZ1	0.76
G1MBW1	0.73
H0VAR7	0.72
F1PX48	0.71
G1T7D7	0.70
G5BQE5	0.67
Q95IT9	0.38
2qrt-A	0.38
Q8HX83	0.36
F6WCX4	0.34
F6JYA9	0.29
D9J389	0.28
D5MSB3	0.27
2zok_E	0.24
H0Y1D0	0.20
Q8SNJ4	0.22
B3FRK3	0.18
Q8HWL2	0.17

pdb: 1p7k_L 0.21
3ov6_A 0.22

2qrt_A 0.38
2zok_E 0.24

ClustalW

Below 30% SeqID: ClustalW_Below_30
Above 60% SeqID: ClustalW_Above_60
Whole range of SeqID: ClustalW_whole

Task 2: Alignments

Contents

Sequence Searches

Blast

Psiblast

PDB search

HHblits

Comparison

Multiple sequence alignments

ClustalW

MAFFT

T-Coffee

Comparison

Navigation menu

Views

Personal tools

Bioinformatik navigation

MediaWiki navigation

Search

Tools