Instructions
Our method is a further development of the TMHMM prediction method.
Our version also performs an ordinary topology prediction on one or several
membrane proteins, but in order to increase the prediction accuracy it can make
use of prior knowledge. The method allows the user to add experimental
(or hypothetical) information on where certain regions in the protein are
located. These regions then become fixed during the prediction. Moreover, it calculates a realiability score that is helpful to the
user for estimating the relevance of the prediction.
The method is described in:
- Melén K, Krogh A and von Heijne G
Reliability measures for membrane protein topology prediction algorithms.
Journal of Molecular Biology, 327(3):735-744, March 2003.
(PubMed) (PDF)
Please cite.
Input
The input should be one or several protein sequences in FASTA format, either pasted into the
text box or loaded via a local file on the user's computer (if sequences are entered at both submission options, prediction is only performed for the pasted sequences). The program recognizes the 20 amino acids. B, Z, and X are all treated equally as unknown. Any other character is changed to X, so please make sure the sequences are sensible proteins.
Example of a protein in FASTA format:
>5H2A_CRIGR you can have comments after the ID
MEILCEDNTSLSSIPNSLMQVDGDSGLYRNDFNSRDANSSDASNWTIDGENRTNLSFEGYLPPTCLSILHL
QEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYRWP
LPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGVSMPIPVF
GLQDDSKVFKQGSCLLADDNFVLIGSFVAFFIPLTIMVITYFLTIKSLQKEATLCVSDLSTRAKLASFSFL
PQSSLSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNE
HVIGALLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENRKPLQLILVNTIPALAYKSSQLQA
GQNKDSKEDAEPTDNDCSMVTLGKQQSEETCTDNINTVNEKVSCV
Output options
Results with graphics
This is the default option.
The program gives some statistics and a list of the location of the predicted
transmembrane helices and the predicted location of the intervening loop
regions.
Here is an example:
# COX2_BACSU Length: 356
# COX2_BACSU Number of predicted TMHs: 3
# COX2_BACSU Exp number of AAs in TMHs: 68.6853700000001
# COX2_BACSU Exp number, first 60 AAs: 39.88783
# COX2_BACSU Total prob of N-in: 0.99962
# COX2_BACSU Reliability score (S3): 1.00
# COX2_BACSU Expected Accuracy: 99%
# COX2_BACSU POSSIBLE N-term signal sequence
# COX2_BACSU Fixed positions: 356(o)
COX2_BACSU TMHMM2.0 inside 1 6
COX2_BACSU TMHMM2.0 TMhelix 7 29
COX2_BACSU TMHMM2.0 outside 30 43
COX2_BACSU TMHMM2.0 TMhelix 44 66
COX2_BACSU TMHMM2.0 inside 67 86
COX2_BACSU TMHMM2.0 TMhelix 87 109
COX2_BACSU TMHMM2.0 outside 110 356
If the whole sequence is labeled as inside or outside, the prediction is
that it contains no membrane helices. It is probably not wise to interpret
it as a prediction of location. The prediction gives the most probable
location and orientation of transmembrane helices in the sequence. It is found
by an algorithm called N-best (or 1-best in this case) that sums over all paths
through the model with the same location and direction of the helices.
The first few lines gives some statistics:
- Length: The length of the protein sequence.
- Number of predicted TMHs: The number of predicted transmembrane helices.
- Exp number of AAs in TMHs: The expected number of amino acids in transmembrane helices. If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide).
- Exp number, first 60 AAs: The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein. If this number is more than a few, you should be warned that a predicted transmembrane helix in the N-term could be a signal peptide.
- Total prob of N-in: The total probability that the N-term is on the cytoplasmic side of the membrane.
- Reliability score (S3): Measure of how likely the prediction is (not default option), see the Score options part.
- Expected Accuracy: Measure of how trustworthy the prediction is (not default option), see the Score options part.
- POSSIBLE N-term signal sequence: A warning that is produced when "Exp number, first 60 AAs" is larger than 10.
- Fixed positions: The positions that have been fixed during the prediciton (o=outside, i=inside, TM=TMhelix).
Graphics
The plot shows the posterior probabilities of inside/TM helix/outside. Here one can see possible weak TM helices that were not predicted, and one can get an idea of the certainty of each segment in the prediction.
At the top of the plot (between 1 and 1.2 on the vertical axis) the N-best prediction is shown.
The plot is obtained by calculating the total probability that a residue sits in helix, inside, or outside summed over all possible paths through the model. Sometimes it seems like the plot and the prediction are contradictory, but that is because the plot shows probabilities for each residue, whereas the prediction is the over-all most probable structure. Therefore the plot should be seen as a complementary source of information.
Below the plot there are links to:
- The plot in encapsulated postscript.
- A script for making the plot in gnuplot.
- The data for the plot.
Results without graphics
The same output as above except the probability plot.
Results in short version
In the short output format one line is produced for each protein with no graphics. Each line starts with the sequence identifier and then these fields:
- "len=": The length of the protein sequence.
- "ExpAA=": The expected number of amino acids in transmembrane helices (see above).
- "First60=": The expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein (see above).
- "S3score=": The reliability score (not default option).
- "Exp_Acc=": The expected accuracy (not default option).
- "PredHel=": The number of predicted transmembrane helices by N-best.
- "Topology=": The topology predicted by N-best.
For the example above the short output would be:
COX2_BACSU len=356 ExpAA=68.69 First60=39.89 S3score=1.00 Exp_Acc=99% PredHel=3 Topology=i7-29o44-66i87-109o
The topology is given as the position of the transmembrane helices separated by 'i' if the loop is on the inside or 'o' if it is on the outside. The above example 'i7-29o44-66i87-109o' means that it starts on the inside, has a predicted TMH at position 7 to 29, the outside, then a TMH at position 44-66 etc.
Score options
The Reliability Score, S3, is a measure of how
likely the predicted topology is compared to all other possible topologies
generated by the model (i.e p(top)/p(all) where p stands for
probability).
The score can take values between 0 and 1. A predicted topology with S3 score
close to 0 indicates that there are many other topologies that might be as
likey as the one suggested by the model and hence the results should be
considered with caution. The opposite applies to S3 scores close to 1. Then the suggested topology has high probability and there are not many other topologies
that could compete with this.
The Expected Accuracy Score (given in percent) is proportional to the S3
score and estimates how probable it is that the suggested topology is correct.
For a more detailed description we refer to the paper mentioned in the beginning of this document.
Constrained prediction
If one has any preknowledge (or some hypothetical ideas) about the
location of any part of a protein it can be used to constrain the prediction.
The benefit is that it will reduce the number of possible topologies since
all topologies that contradict the constraints are discarded. Therefore
the chance of getting the correct prediction is increased.
One or more of the following fixation alternatives are available:
- N-terminal: Inside (cytoplasmic) or Outside (non-cytoplasmic)
- Any position: Inside (cytoplasmic), Transmembrane or Outside (non-cytoplasmic)
- C-terminal: Inside (cytoplasmic) or Outside (non-cytoplasmic)
For positions that are not at the start (N-terminal) or end (C-terminal) of a protein the user can choose how long the restricted regions should be. In the "Start pos:" box the residue number that starts the region should be entered and in the "End pos:" box the residue number that ends the region should be entered. If only a single residue is to be fixed the same number in both the "Start pos:" and "End pos" boxes should be entered.
Here is an example where position 115 is fixed to an inside loop and the C-terminus is fixed to the outside:
It is only possible to make constraints for one protein at a time. If several proteins are pasted (or uploaded via a file), it is only the first protein that will be predicted with the specified constraints.