Supplementary information for BMC Bioinformatics 2006, 7:16
Automatic discovery of cross-family sequence features associated with protein function
Markus Brameier, Josien Haan, Andrea Krings and Robert M. MacCallum
Stockholm Bioinformatics Center, Stockholm University, 106-91 Stockholm, Sweden
Jump to Supplementary Figure A, Supplementary Table A, or the Supplementary Material.
Supplementary Figure A
Available in HTML and PDF formats. Please note that the PDF document
may have to be rescaled for printing.
As in Figure 2 of the article, the predictors are clustered with a 8 x
8 Kohonen Self-Organising Map (SOM). In this figure, the evolved
annotation word boolean expression (from the
annotation_classifier subroutine) are shown in full for each
of the 500 evolved function predictors. Each boolean expression is
separated by a semicolon. The A-type predictors are shown with upper
case to identify them.
Supplementary Table A
Available in HTML format.
Here we show the full list of the 150 most common annotation words
after manual filtering. The filtering is performed in order to remove
stopwords and words that do not contain any information about protein
function. The filtered words are shown with
Matthews Correlation Coefficient
CC = (tp*tn - fn*fp) / sqrt( (tn+fn)(tn+fp)(tp+fn)(tp+fp) )
Training and testing data
Where four-fold cross-validation was performed, each of the following
"cuts" of the data was used as a test set, while the other three were
used as training data. In all other cases, the "training set" is the
concatenation of cuts 1 to 3 and the "test set" is cut 4.
Datafile description: column 1 contains the UniProt/Swiss-Prot ID, column 2 contains the UniProt/Swiss-Prot AC, column 3 contains the amino acid sequence, columns 4 and onwards contain the annotation words.
Running it yourself with PerlGP
Here are some guidelines if you would like to try the self-supervised
Please contact Bob MacCallum if you have any problems or questions.
- Install PerlGP and get to the
stage where you can run one or two of the demos and know how to look
at the results they produce. At present, support can only be given
for Unix-like operating systems (e.g. Linux, Mac OS X), but it may
work on others.
- Install PDL, if your Perl installation doesn't already have it.
- Create a new directory called "funcpred" (for example), download
this tar.gz file and unpack it so
that you create a subdirectory "funcpred/funcpred-demo".
- Create a new subdirectory "funcpred/data" and in this directory,
concatenate three of the datafiles (cuts 1 to 4 above) into "training.dat" and rename
remaining one as "testing.dat".
- Now you should be able to start the self-supervised learning with:
# change directory to the "experiment directory"
# quickly check that the data files are where they should be
ls ../data/training.dat ../data/testing.dat
# clean out any old data for this experiment
# this program shows you what a random program looks like
# (like those in the initial population)
# start the run so that it restarts if there are any unavoidable crashes
(ignore the warnings about divide by zero and log(x≤0))