METHODS
The procedure for constructing this database consists of five major steps:
Domain selection
For this database we started with the 9498 domains from SCOP
1.65 with 95% or less sequence identity. 2360 these domains were further divided
into 5736 subdomains using a novel procedure by Michael Levitt (to be published) resulting
in a total of 12874 domains constituting our data set.
Structural Alignments
Using the method STRUCTAL (Subbiah et al., Current Biol. 3, 141 (1993))
we calculated structural alignments for all 165,727,002 pairs of domains. STRUCTAL utilizes iterative dynamic
programming to calculate an alignment. Starting from a random orientation of the two structures
an alignment is calculated based on a similarity matrix which only depends on the inter-atomic distances.
Using the equivalences implied by this alignment the structures are least-square fitted onto each other. The procedure
is repeated until convergence.
As an initial filter only the 1,051,284 alignments with a log P -value less than -2.3 were kept, i.e. those reported 'ok' by STRUCTAL.
However, after this pre-filtering GSAS (Gapped Structural Alignment Score) was used to judge the significance of the alignments.
GSAS is defined as 100*cRMS/(N_mat-N_break), where N_mat is the number of aligned residues and N_break is the number of breaks in the alignment.
In a recent study comparing several different structural alignment quality measures
GSAS was found to perform best in separating good alignments from less good ones (Michael Levitt, 2004, personal communication), with a lower GSAS indicating a more significant alignment.
Clustering
For the clustering, we defined a network were each domain is a node and edges represents alignments whose GSAS is below a threshold.
The distance between any two nodes were taken as the number of edges separating them.
The clustering was performed using Hierarchical clustering (Johnson, S.C. 1967, Psychometrika 2:241-254). In this method one starts
out with as many clusters as datapoints. The clustering proceeds by repeatedly merging the two 'closest' clusters until all the
distances between clusters are larger than some cutoff. Our cutoff was set to two links in order to ensure our clusters consists
of members which are highly similar to each other.
To decide which GSAS cut-off to use we performed the clustering for several different
values and looked for the value which maximizes the number of clusters with 3 or more members.
For this we also looked at which overlap cut-off to use, where overlap is defined as the number of aligned residues divided by the length of the shortest protein.
| GSAS | Overlap (%) | N_3 |
| 1.0 | 80 | 677 |
| 1.0 | 85 | 655 |
| 1.0 | 90 | 627 |
| 1.0 | 95 | 598 |
| 1.5 | 80 | 920 |
| 1.5 | 85 | 868 |
| 1.5 | 90 | 826 |
| 1.5 | 95 | 775 |
| 2.0 | 80 | 1060 |
| 2.0 | 85 | 1002 |
| 2.0 | 90 | 976 |
| 2.0 | 95 | 901 |
| 2.5 | 80 | 1134 |
| 2.5 | 85 | 1090 |
| 2.5 | 90 | 1059 |
| 2.5 | 95 | 962 |
| 3.0 | 80 | 1164 |
| 3.0 | 85 | 1137 |
| 3.0 | 90 | 1095 |
| 3.0 | 95 | 1005 |
Hence we decided to only use alignments with GSAS<3.0 and Overlap > 80%.
Multiple Structural Alignments
For each cluster containing 3 or more members we used the pairwise alignments to calculate a Multiple Structural Alignment (MSTA).
This was done using a method called RESOLVER which implements the integer linear programming
formulation of the maximum weight trace problem by Reinert et al.
(Reinert, K., Lenhof, H.-P., Mutzel, P., Mehlhorn, K. and Kececioglu, J. 1997. In Proceedings of the First
Annual International Conference on Computational Molecular Biology (RECOMB-97). ACM Press, New York, pp 241-249.). The main idea behind RESOLVER is that all the information
needed for the MSTA is contained in the pairwise alignments. However, these alignments, in general, contains ordering conflicts which have to
be resolved in order to construct a consistent MSTA. RESOLVER views this as an optimization problem where one wants to remove as few residue equivalences as possible
such that the ordering conflicts disappear. For some clusters RESOLVER failed to produce a MSTA as these clusters contained too many members. These
clusters were subdivided by using a more stringent GSAS-cutoff and reclustering these clusters. The GSAS cutoff was lowered until the resulting subclusters
had a size which could be handled by RESOLVER. In the datafiles these subclusters are denoted X.X, i.e. the subclusters of cluster 5 are called 5.0, 5.1, ....
In the alignments upper case letters indicates aligned residues and lower case unaligned. This notation is introduced
for cosmetic reasons: if we would show the gaps explicitly the alignments would be very long. The secondary structure
assignments were obtained using STRIDE. '*' indicates positions were STRIDE fail to assign a secondary structure. Furthermore, some domains do not have any secondaru structure assignments at all. These are domains were the pdb-file do not contain eniugh information for STRIDE to make the assignments.
Interactive Map of Protein Structure Space
The layout for the interactive map was constructed using the LGL (Large Graph Layout) software.
LGL was fed with the set of edges corresponding to all alignments with GSAS<5.0 and the GSAS-values of the alignments were used as weights when generating the layout.