Supplementary information for BMC Bioinformatics 2006, 7:16
Automatic discovery of cross-family sequence features associated with protein functionMarkus Brameier, Josien Haan, Andrea Krings and Robert M. MacCallumStockholm Bioinformatics Center, Stockholm University, 106-91 Stockholm, Sweden Jump to Supplementary Figure A, Supplementary Table A, or the Supplementary Material. Supplementary Figure AView/downloadAvailable in HTML and PDF formats. Please note that the PDF document may have to be rescaled for printing.DescriptionAs in Figure 2 of the article, the predictors are clustered with a 8 x 8 Kohonen Self-Organising Map (SOM). In this figure, the evolved annotation word boolean expression (from the annotation_classifier subroutine) are shown in full for each of the 500 evolved function predictors. Each boolean expression is separated by a semicolon. The A-type predictors are shown with upper case to identify them.Supplementary Table AViewAvailable in HTML format.DescriptionHere we show the full list of the 150 most common annotation words after manual filtering. The filtering is performed in order to remove stopwords and words that do not contain any information about protein function. The filtered words are shown withSupplementary MaterialMatthews Correlation CoefficientCC = (tp*tn - fn*fp) / sqrt( (tn+fn)(tn+fp)(tp+fn)(tp+fp) )Training and testing dataWhere four-fold cross-validation was performed, each of the following "cuts" of the data was used as a test set, while the other three were used as training data. In all other cases, the "training set" is the concatenation of cuts 1 to 3 and the "test set" is cut 4. Datafile description: column 1 contains the UniProt/Swiss-Prot ID, column 2 contains the UniProt/Swiss-Prot AC, column 3 contains the amino acid sequence, columns 4 and onwards contain the annotation words.Running it yourself with PerlGPHere are some guidelines if you would like to try the self-supervised learning yourself.
|