METHODS


The procedure for constructing this database consists of five major steps:

Domain selection

For this database we started with the 9498 domains from SCOP 1.65 with 95% or less sequence identity. 2360 these domains were further divided into 5736 subdomains using a novel procedure by Michael Levitt (to be published) resulting in a total of 12874 domains constituting our data set.

Structural Alignments

Using the method STRUCTAL (Subbiah et al., Current Biol. 3, 141 (1993)) we calculated structural alignments for all 165,727,002 pairs of domains. STRUCTAL utilizes iterative dynamic programming to calculate an alignment. Starting from a random orientation of the two structures an alignment is calculated based on a similarity matrix which only depends on the inter-atomic distances. Using the equivalences implied by this alignment the structures are least-square fitted onto each other. The procedure is repeated until convergence.

As an initial filter only the 1,051,284 alignments with a log P -value less than -2.3 were kept, i.e. those reported 'ok' by STRUCTAL. However, after this pre-filtering GSAS (Gapped Structural Alignment Score) was used to judge the significance of the alignments. GSAS is defined as 100*cRMS/(N_mat-N_break), where N_mat is the number of aligned residues and N_break is the number of breaks in the alignment. In a recent study comparing several different structural alignment quality measures GSAS was found to perform best in separating good alignments from less good ones (Michael Levitt, 2004, personal communication), with a lower GSAS indicating a more significant alignment.

Clustering

For the clustering, we defined a network were each domain is a node and edges represents alignments whose GSAS is below a threshold. The distance between any two nodes were taken as the number of edges separating them. The clustering was performed using Hierarchical clustering (Johnson, S.C. 1967, Psychometrika 2:241-254). In this method one starts out with as many clusters as datapoints. The clustering proceeds by repeatedly merging the two 'closest' clusters until all the distances between clusters are larger than some cutoff. Our cutoff was set to two links in order to ensure our clusters consists of members which are highly similar to each other.

To decide which GSAS cut-off to use we performed the clustering for several different values and looked for the value which maximizes the number of clusters with 3 or more members. For this we also looked at which overlap cut-off to use, where overlap is defined as the number of aligned residues divided by the length of the shortest protein.
GSASOverlap (%)N_3
1.080677
1.085655
1.090627
1.095598
1.580920
1.585868
1.590826
1.595775
2.0801060
2.0851002
2.090976
2.095901
2.5801134
2.5851090
2.5901059
2.595962
3.0801164
3.0851137
3.0901095
3.0951005

Hence we decided to only use alignments with GSAS<3.0 and Overlap > 80%.

Multiple Structural Alignments

For each cluster containing 3 or more members we used the pairwise alignments to calculate a Multiple Structural Alignment (MSTA). This was done using a method called RESOLVER which implements the integer linear programming formulation of the maximum weight trace problem by Reinert et al. (Reinert, K., Lenhof, H.-P., Mutzel, P., Mehlhorn, K. and Kececioglu, J. 1997. In Proceedings of the First Annual International Conference on Computational Molecular Biology (RECOMB-97). ACM Press, New York, pp 241-249.). The main idea behind RESOLVER is that all the information needed for the MSTA is contained in the pairwise alignments. However, these alignments, in general, contains ordering conflicts which have to be resolved in order to construct a consistent MSTA. RESOLVER views this as an optimization problem where one wants to remove as few residue equivalences as possible such that the ordering conflicts disappear. For some clusters RESOLVER failed to produce a MSTA as these clusters contained too many members. These clusters were subdivided by using a more stringent GSAS-cutoff and reclustering these clusters. The GSAS cutoff was lowered until the resulting subclusters had a size which could be handled by RESOLVER. In the datafiles these subclusters are denoted X.X, i.e. the subclusters of cluster 5 are called 5.0, 5.1, .... In the alignments upper case letters indicates aligned residues and lower case unaligned. This notation is introduced for cosmetic reasons: if we would show the gaps explicitly the alignments would be very long. The secondary structure assignments were obtained using STRIDE. '*' indicates positions were STRIDE fail to assign a secondary structure. Furthermore, some domains do not have any secondaru structure assignments at all. These are domains were the pdb-file do not contain eniugh information for STRIDE to make the assignments.

Interactive Map of Protein Structure Space

The layout for the interactive map was constructed using the LGL (Large Graph Layout) software. LGL was fed with the set of edges corresponding to all alignments with GSAS<5.0 and the GSAS-values of the alignments were used as weights when generating the layout.