SBC-WCN workshop on protein structure/function prediction and comparative genomics
To be held at Albanova University Center, Stockholm, on May 28-29. All
presentations in lecture hall FD5.
Saturday
Sunday
Abstracts
Evidence for widespread reticulate evolution within human
duplicons
Approximately 5% of the human genome consists of segmental duplications
which can cause genomic mutations and may play a role in gene
innovation. Reticulate evolutionary processes such as unequal crossing
over and gene conversion are known to occur within specific duplicon
families, but the broader contribution of these processes to the
evolution of human duplications remains poorly characterised. We have
used phylogenetic profiling to analyse multiple alignments of 24 human
duplicon families which span >8Mb of DNA and find that none are evolving
independently; all alignments show sharp discontinuities in phylogenetic
signal. To analyse these in more detail we have developed a quartet
method which estimates the relative contribution of nucleotide
substitution and reticulation within these sequences. Most duplications
show a highly significant excess of sites consistent with reticulation
compared to the number expected by nucleotide substitution alone, with
15 out of 30 alignments showing a >20X excess over expectation. Using
runs tests we also show that at least 5% of the total sequence shares
100% sequence identity due to reticulation, a figure which includes 74
independent tracts of perfect identity >2kb in length. Furthermore,
analysis of a subset of alignments indicates that the density of
reticulation events is as high as 1 every 4kb. These results have
important implications for efforts to finish the human genome sequence,
complicate comparative sequence analysis of duplicon families, and could
profoundly influence the tempo of gene family evolution.
Fast algorithm for calculating the likelihood of
a tree in a gene gain-loss-duplication model
The evolution of gene content is an intensively investigated problem in
genomics. To our best knowledge, standard Markov models for gene content
evolution used so far are only presence/absence models that neglect
information about copy numbers of genes in genomes.
We propose a three parameter time-continuous Markov model for gene content
evolution, one parameter for gain (horisontal transfer), loss and
duplication each, which can handle arbitrary copy numbers.
We show that transition probabilities in this model can be calculated
analytically, moreover, the likelihood of a tree given copy numbers of a
gene at the leaves and evolutionary parameters can be calculated in
polynomial time.
A tree-based Gibbs motif sampler for unaligned orthologous upstream sequences
In phylogenetic foot-printing, putative regulatory elements are found
in upstream regions of orthologous genes by searching for common
motifs. Gibbs sampling is one successful method for finding common
motifs. Since the orthologous sequences are related by a tree and
differences between motif instances, in different upstream regions,
are caused by mutational events along its edges, taking advantage of
the tree in the motif search is an obvious as well as appealing idea.
We describe the Tree-Based Gibbs motif sampler, which is a Gibbs
sampler based on a general tree which takes unaligned sequences as
input. An implementation of the tree based sampler will be described
as well as in silico experimental results that show clear advantages
of the method.
Phylogenetic Networks
Phylogenetic trees are used to describe the evolution of a single gene under a model of evolution involving only mutation and speciation events. When studying the evolution of more than one gene, or under more realistic models of evolution,
phylogenetic trees often do not suffice and more general phylogenetic networks must be employed.
We give an introduction to a number of different types of phylogenetic networks, including splits networks,
hybridization networks and recombination networks. Further, we discuss a number of algorithms that compute such networks and demonstrate their use on some examples.
Minimising frustration in protein sequence analysis
Genomics and other sequencing efforts are producing a flood of
protein sequence data. Correct evolutionary classification and the
identification of subtle sequence motifs are keys to deciphering the
function and structure of these proteins. Profile models are a
powerful and widely used tool to detect relations among protein
sequences. Profiles measure the distance of a sequence to the centre
of the family. This works very well at relatively short distances,
but breaks down when dealing with remote homologues. Unfortunately,
structure comparisons have revealed many families that form elongated
clusters in "sequence space" and no single profile model can detect
all members of such families. Our approach defines families using
profiles at close range where they perform reliably and then
propagates the assignment of homology to nearest neighbours. Using
the simple principle of minimal frustration across cluster
boundaries, we can detect significantly more remote homologues than
the most advanced profile-profile (HMM-HMM) comparison methods. The
improved detection of remote homologues is ascribed to accurate
detection of sparse sequence signatures.
Reference
Heger A, Lappe M, Holm L (2004) Accurate detection of very sparse
sequence motifs. J Comp Biol 11, 843-857.
Protein sequence to structure alignment and other fantasies
Our most recent protein sequence to structure alignment methods
have been based on a mixture of rigorous statistics, numerical
optimisation, wild empiricism and several acts of faith.
On the one side, we assume that short patterns in sequences will
be correlated with patterns in structure. We then hope that we
can capture these relationships with unsupervised Bayesian
classifiers. If all goes well, we can make the claim of having
true probabilistic measures of sequence to structure
compatibility functions.
Because we do not believe our own results, we have built much in
the way of numerical optimisation machinery which allows
arbitrary parameters to be adjusted so as to produce very good
sequence to structure alignments. It is not too difficult to
squeeze various kinds of sequence comparison terms into this
framework. The result is that individual components are hard to
justify, but fortunately, they are hard to criticise. What might
appear to be a dog's breakfast of ideas, is glued into a properly
balanced diet of numerical components optimised to work with each
other.
The ring of life provides evidence for a genome fusion origin of eukaryotes.
Genomes hold within them the record of the evolution of life on Earth. But
genome fusions and horizontal gene transfer appear to have sufficiently
obscured the gene sequence record such that it is difficult to reconstruct
the tree of life. Here we determine the general outline of the tree using
complete genome data from representative prokaryotes and eukaryotes and a
new genome analysis method that makes it possible to reconstruct ancient
genome fusions and phylogenetic trees. Our analyses indicate that the
eukaryotic genome resulted from a fusion of two diverse prokaryotic genomes,
and therefore at the deepest levels linking prokaryotes and eukaryotes, the
tree of life is actually a ring of life. One fusion partner branches from
deep within an ancient photosynthetic clade, and the other is related to the
archaeal prokaryotes. The eubacterial organism is either a Proteobacterium,
or a member of a larger photosynthetic clade that includes the Cyanobacteria
and the Proteobacteria.
The structure, dynamics and mechanism of water permeation and proton exclusion in aquaporin water channels
`Real time' molecular dynamics simulations of water permeation through the
pores of both aquaporin-1 and the homologous bacterial glycerol facilitator
GlpF are presented, from which a time-resolved, atomic-resolution model of
the permeation mechanism across these highly selective membrane channels
was obtained. Both proteins act as two-stage filters: conserved fingerprint
(Asparagine-Proline-Alanine, NPA) motifs together with a second
(`aromatic/Arginine') region jointly enable the selective, yet efficient
permeation of water and linear alcohols, respectively [1].
A particularly intriguing and longstanding puzzle in aquaporin
research has been the ability of these proteins to prevent a proton
flux across their pores, because water and other aqueous pores
efficiently conduct protons, via the so-called Grotthuss mechanism.
How proton exclusion is reconciled with the seemingly contradicting
task of efficient water permeation, has been addressed by proton transfer
simulations, which show that a strong electrostatic field across the pore
is likely to be the main determinant of proton exclusion [2].
References
- Bert L. de Groot and Helmut Grubmueller; Water Permeation Across
Biological Membranes: Mechanism and Dynamics of Aquaporin-1 and GlpF.
Science 294:2353-2357 (2001)
- Bert L. de Groot, Tomaso Frigato, Volkhard Helms and Helmut Grubmueller;
The mechanism of proton exclusion in the aquaporin-1 water channel.
J. Mol. Biol. 333: 279-293 (2003).
A Domain Interaction Map Based on Phylogenetic
Profiling
Phylogenetic profiling is a well established method for predicting
functional relations and physical interactions between proteins.We present
a new method for finding such relations based on phylogenetic profiling of
conserved domains rather than proteins, avoiding computationally
expensive all versus all sequence comparisons among genomes. The
resulting domain interaction map (DIMA) can be explored directly or
mapped to a genome of interest. We demonstrate that the performance of
DIMA is comparable to that of classical phylogenetic profiling and its
predictions often yield information that cannot be detected by profiling of
entire protein chains. A comprehensive DIMA Web-resource will also be
presented.
Experimentally-based topology predictions for membrane proteomes
In the absence of a high-resolution three-dimensional structure, an
important corner stone for the functional analysis of any membrane
protein is an accurate topology model. A topology model describes the
number of transmembrane spans and the orientation of the protein
relative to the lipid bilayer. We have previously shown that topology
prediction algorithms can be greatly improved by constraining them
with an experimentally determined reference point, such as the
location of a protein's C-terminus. Such reference points can be
obtained most easily through the use of topology reporter
proteins. Recently we have mapped the C-terminal location for two
membrane proteomes: the E. coli inner membrane proteome and the
S. cerevisiae endoplasmic reticulum proteome. This experimental data
can be transferred to other membrane proteomes by homology, enabling
us to derive experimentally-based topologies en masse.
Lars Arvestad
Last modified: Fri May 27 15:09:59 CEST 2005