|
In a standard
substitution matrix, such as BLOSUM62, the substitution of one amino acid
with another is associated with a single score. Database searches can
be customized to find specific protein families or domains using substitution
scores that reflect the substitution frequencies of each individual amino
acid position in a domain. Position-specific scoring matrix or HMMs profiles,
in its simplest form, may consist of a set of 20 substitution scores at
each position along the motif - one for each of the amino acids. Amino
acids that are commonly found at a particular position receive higher
scores and amino acids unlikely to appear at that position receive lower
scores.
The HMM is
used to statistically describe a protein family's consensus sequence.
This statistical description can be used for sensitive and selective database
searching.
The model consists of a linear sequence of nodes with a begin state (B)
and an end state (E). Each node in an HMM has a match state (M), insert
state (I) and delete state (D) with position-specific probabilities for
transitioning into each of these states from the previous node.
In addition to a transition probability, the match state also has position-specific
probabilities for seeing a particular residue. Similarly, the insert state
has probabilities for inserting a residue at the position given by the
node. There is also a chance that no residue is associated with a node.
That probability is indicated by the probability of transitioning to the
delete state.
Both transition and emission probabilities can be generated from a multiple
alignment of a family of sequences.
An HMM can
be compared (that is, aligned) with a new sequence to determine the probability
that the sequence belongs to the modelled family. The most probable path
through the HMM (i.e., which transitions were taken and which residues
were emitted at match and insert states) taken to generate a sequence
similar to the new sequence determines the similarity score.
Unlike traditional
pair-wise alignment algorithms such as BLAST, FastA, and Smith-Waterman,
HMMs use position-specific scoring for the matching or substitution of
a residue and for the opening or extension of a gap. Position-specific
scoring allows HMMs to characterize entire families of sequences by modelling
the extent to which the regions should be conserved in a multiple alignment.
As a result, distantly related proteins can be found even with low sequence
identity, if the similarities and differences are common to the family
members. This type of analysis is quite powerful because the function
of divergent proteins is conserved through evolution even though sequence
elements are free to change in some areas.
References:
Andreas D. Baxevanis, B. F. Francis Ouellette. " Bioinformatics -
A practical Guide to the Analysis of Genes and Proteins, Wiley-Interscience
Paracel
GeneMatcher documentation
Classic papers and books:
Baldi, P., Y. Chauvin, T. Hunkapillar, and M.A. McClure. "Hidden
Markov models of biological primary sequence information," Proc.
Natl. Acad. Sci., Vol. 91, pp. 1059-1063, 1994.
Durbin, R., S. Eddy, A. Krogh, G. Mitchison. Biological Sequence Analysis.
Cambridge U. Press, 1998.
Krogh,
A., M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. "Hidden Markov
Models in Computational Biology: Applications to Protein Modeling,"
J. Mol. Biol., Vol. 235, pp.1501-1531, 1994.
|