|advertisement: compare things at compare-stuff.com!|
Using a rather limited dataset, consistent improvements beyond a baseline set by the Smith Waterman method have been obtained in the number of correctly identified top ranking folds, . Comparisons with other methods are difficult to make, except in a fully blind trial such as CASP. Rost et al.rost:pbt recently published work which compared all combinations of alignments of sequence, predicted secondary structure and observed secondary structure, using log-odds matrices. Their baseline, using the Smith Waterman algorithm with a matrix from McLachlan[McLachlan et al., 1984] produced 16% correct first hits from 89 queries and a library of 723. Using a combination of sequence and PHD secondary structure predictions for both query and library sequences, they obtained 27% correct first hits. Introducing known secondary structure to the library sequence information improved the performance by only a few percent. Our baseline was higher, at (26%), and was increased to (48%) using a combination of hydrophobicity and secondary structure prediction probabilities. Rost et al. showed, however, that the percentage of correct first hits was around 50% using only query folds which structurally aligned with library sequences with more than 70% overlap. The detection of partial matches is a much harder problem, as was also found at the CASP2 meeting[Marchler-Bauer & Bryant, 1997]. In this respect, our query set using domains (not whole chains) of between 100 and 300 residues is `easy' (but see below), and probably explains the agreement of our results with those of Rost et al.. Our results are obtained without structural information however, and with a different set of proteins.
How do the results of sequence-only profile methods (including hidden Markov models and position specific scoring methods) compare to our results and fold recognition methods as a whole? As discussed in Section 4.1, the benchmarks for sequence methods are generally quite different to those for fold recognition methods. Blind (or at least coordinated) testing on the same data is the only fair comparison. At CASP2 there were unfortunately too few targets to properly compare accuracies between methods. Most methods did better on easier targets: those with extensive overlap with known folds, and/or slightly more related sequences (judged by sequence identity) or common sequence motifs related to function. Sequence-only hidden Markov methods also did well on these, but not so well on the harder targets (S. Bryant, personal communication; full evaluation to be published in Proteins: Structure, Function and Genetics). Many sequence-only `profilers' did not take part in the experiment.
The alignment of sequence property vectors described in this work is quite similar to profile methods in two ways. Firstly, evolutionary information is incorporated from multiple sequence alignments and secondary structure predictions. Secondly, amino acid substitution matrices (used in profile and non-profile methods) indirectly encode much of the information that we have used, hydrophobicity in particular. The method does not employ position specific gap penalties, however (although they could be added quite easily). Considering the simplicity of the method, why does it seem to perform so well on this small dataset? The small number of queries (27) clearly has some bearing on the results. It is widely accepted that small datasets, regardless how unbiased they are, tend to over-perform relative to the real expected accuracy in blind trials. Furthermore, our dataset is biased in favour of success. Domains were selected for having 10 or more non-redundant (to 70% pairwise identity) multiple sequences. The optimised evolutionary information content of the dataset will inevitably improve the sensitivity of our sequence searches. Of course, we can state that our method has % accuracy if there are at least suitable multiple sequences for the query sequence, and if a similar fold exists in the library (null predictions will be discussed below).
It was simple enough to apply SIVA (using a 1:1 combination of hydrophobicity and two-state DSC predictions) to a larger set of query and library folds. Now allowing domains of 100-300 residues with at least 5 multiple sequences, 78 queries and 197 library domains were available. This set now contains many more domains with lower quality evolutionary information. Furthermore it is expected that using a larger library, more false hits will occur by chance. With our method there are (45%) correct top hits compared to (18%) for the Smith Waterman control, thus roughly the same increase in performance is observed with the larger trial. It should be stressed that the figure of 45% is not an estimate of the accuracy of distant homologue detection, since the dataset contained a number of easily detectable homologues (14, using Smith Waterman).
Log-odds and position specific matrix methods require the discretisation of sequence data. Amino acid sequences are inherently discretised into 20 classes, but predictions of secondary structure or accessibility require discretisation[Fischer & Eisenberg, 1996,Rice & Eisenberg, 1997,Rost et al., 1997] leading to the loss of information. Even position specific scoring matrices can be rather `lumpy' with sparse data (few multiple alignments). From our results, the direct comparison (using the Euclidean distance) of mean hydrophobicities and secondary structure prediction probabilities appears to be very effective. It is not necessary to decide where to set the boundaries for different classes of hydrophobicity or prediction probability hence none of this information is lost (although the hydrophobicity scales have by definition 20 discrete values). We suggest that the magnitude of sequence-derived information may be as important as the patterns of discrete states, which have been the basis for most methods in sequence comparison and structure prediction.