SBC-WCN workshop on protein structure/function prediction and comparative genomics

To be held at Albanova University Center, Stockholm, on May 28-29. All presentations in lecture hall FD5.

Saturday

09.30Michael JacksonNewcastle
Evidence for widespread reticulate evolution within human duplicons
10.20Coffee
10.50Istvan MiklosLoránd Eötvös University
Fast algorithm for calculating the likelihood of a tree in a gene gain-loss-duplication model
11.40Jens LagergrenStockholm Bioinformatics Center
A tree-based Gibbs motif sampler for unaligned orthologous upstream sequences.
12.30Lunch
14.00Daniel HusonTübingen
Phylogenetic Networks
14.50Coffee
15.20Jim LakeUCLA
The ring of life provides evidence for a genome fusion origin of eukaryotes
16.10Bert de GrootMax Planck Institute for Biophysical Chemistry
The structure, dynamics and mechanism of water permeation and proton exclusion in aquaporin water channels

Sunday

10.20Coffee
10.50Dmitrij FrishmanTechnische Universität München
A Domain Interaction Map Based on Phylogenetic Profiling
11.40Mark JohnsonÅbo
12.30Lunch
14.00Liisa HolmHelsinki
Minimising frustration in protein sequence analysis
14.50Coffee
15.20Andrew TordaHamburg
Protein sequence to structure alignment and other fantasies
16.10Dan DaleyStockholm University
Experimentally-based topology predictions for membrane proteomes

Abstracts

Michael Jackson

Evidence for widespread reticulate evolution within human duplicons

Approximately 5% of the human genome consists of segmental duplications which can cause genomic mutations and may play a role in gene innovation. Reticulate evolutionary processes such as unequal crossing over and gene conversion are known to occur within specific duplicon families, but the broader contribution of these processes to the evolution of human duplications remains poorly characterised. We have used phylogenetic profiling to analyse multiple alignments of 24 human duplicon families which span >8Mb of DNA and find that none are evolving independently; all alignments show sharp discontinuities in phylogenetic signal. To analyse these in more detail we have developed a quartet method which estimates the relative contribution of nucleotide substitution and reticulation within these sequences. Most duplications show a highly significant excess of sites consistent with reticulation compared to the number expected by nucleotide substitution alone, with 15 out of 30 alignments showing a >20X excess over expectation. Using runs tests we also show that at least 5% of the total sequence shares 100% sequence identity due to reticulation, a figure which includes 74 independent tracts of perfect identity >2kb in length. Furthermore, analysis of a subset of alignments indicates that the density of reticulation events is as high as 1 every 4kb. These results have important implications for efforts to finish the human genome sequence, complicate comparative sequence analysis of duplicon families, and could profoundly influence the tempo of gene family evolution.

Istvan Miklos

Fast algorithm for calculating the likelihood of a tree in a gene gain-loss-duplication model

The evolution of gene content is an intensively investigated problem in genomics. To our best knowledge, standard Markov models for gene content evolution used so far are only presence/absence models that neglect information about copy numbers of genes in genomes. We propose a three parameter time-continuous Markov model for gene content evolution, one parameter for gain (horisontal transfer), loss and duplication each, which can handle arbitrary copy numbers. We show that transition probabilities in this model can be calculated analytically, moreover, the likelihood of a tree given copy numbers of a gene at the leaves and evolutionary parameters can be calculated in polynomial time.

Jens Lagergren

A tree-based Gibbs motif sampler for unaligned orthologous upstream sequences

In phylogenetic foot-printing, putative regulatory elements are found in upstream regions of orthologous genes by searching for common motifs. Gibbs sampling is one successful method for finding common motifs. Since the orthologous sequences are related by a tree and differences between motif instances, in different upstream regions, are caused by mutational events along its edges, taking advantage of the tree in the motif search is an obvious as well as appealing idea. We describe the Tree-Based Gibbs motif sampler, which is a Gibbs sampler based on a general tree which takes unaligned sequences as input. An implementation of the tree based sampler will be described as well as in silico experimental results that show clear advantages of the method.

Daniel Huson

Phylogenetic Networks

Phylogenetic trees are used to describe the evolution of a single gene under a model of evolution involving only mutation and speciation events. When studying the evolution of more than one gene, or under more realistic models of evolution, phylogenetic trees often do not suffice and more general phylogenetic networks must be employed. We give an introduction to a number of different types of phylogenetic networks, including splits networks, hybridization networks and recombination networks. Further, we discuss a number of algorithms that compute such networks and demonstrate their use on some examples.

Liisa Holm

Minimising frustration in protein sequence analysis

Genomics and other sequencing efforts are producing a flood of protein sequence data. Correct evolutionary classification and the identification of subtle sequence motifs are keys to deciphering the function and structure of these proteins. Profile models are a powerful and widely used tool to detect relations among protein sequences. Profiles measure the distance of a sequence to the centre of the family. This works very well at relatively short distances, but breaks down when dealing with remote homologues. Unfortunately, structure comparisons have revealed many families that form elongated clusters in "sequence space" and no single profile model can detect all members of such families. Our approach defines families using profiles at close range where they perform reliably and then propagates the assignment of homology to nearest neighbours. Using the simple principle of minimal frustration across cluster boundaries, we can detect significantly more remote homologues than the most advanced profile-profile (HMM-HMM) comparison methods. The improved detection of remote homologues is ascribed to accurate detection of sparse sequence signatures.
Reference
Heger A, Lappe M, Holm L (2004) Accurate detection of very sparse sequence motifs. J Comp Biol 11, 843-857.

Andrew Torda

Protein sequence to structure alignment and other fantasies

Our most recent protein sequence to structure alignment methods have been based on a mixture of rigorous statistics, numerical optimisation, wild empiricism and several acts of faith. On the one side, we assume that short patterns in sequences will be correlated with patterns in structure. We then hope that we can capture these relationships with unsupervised Bayesian classifiers. If all goes well, we can make the claim of having true probabilistic measures of sequence to structure compatibility functions. Because we do not believe our own results, we have built much in the way of numerical optimisation machinery which allows arbitrary parameters to be adjusted so as to produce very good sequence to structure alignments. It is not too difficult to squeeze various kinds of sequence comparison terms into this framework. The result is that individual components are hard to justify, but fortunately, they are hard to criticise. What might appear to be a dog's breakfast of ideas, is glued into a properly balanced diet of numerical components optimised to work with each other.

Jim Lake

The ring of life provides evidence for a genome fusion origin of eukaryotes.

Genomes hold within them the record of the evolution of life on Earth. But genome fusions and horizontal gene transfer appear to have sufficiently obscured the gene sequence record such that it is difficult to reconstruct the tree of life. Here we determine the general outline of the tree using complete genome data from representative prokaryotes and eukaryotes and a new genome analysis method that makes it possible to reconstruct ancient genome fusions and phylogenetic trees. Our analyses indicate that the eukaryotic genome resulted from a fusion of two diverse prokaryotic genomes, and therefore at the deepest levels linking prokaryotes and eukaryotes, the tree of life is actually a ring of life. One fusion partner branches from deep within an ancient photosynthetic clade, and the other is related to the archaeal prokaryotes. The eubacterial organism is either a Proteobacterium, or a member of a larger photosynthetic clade that includes the Cyanobacteria and the Proteobacteria.

Bert de Groot

The structure, dynamics and mechanism of water permeation and proton exclusion in aquaporin water channels

`Real time' molecular dynamics simulations of water permeation through the pores of both aquaporin-1 and the homologous bacterial glycerol facilitator GlpF are presented, from which a time-resolved, atomic-resolution model of the permeation mechanism across these highly selective membrane channels was obtained. Both proteins act as two-stage filters: conserved fingerprint (Asparagine-Proline-Alanine, NPA) motifs together with a second (`aromatic/Arginine') region jointly enable the selective, yet efficient permeation of water and linear alcohols, respectively [1].

A particularly intriguing and longstanding puzzle in aquaporin research has been the ability of these proteins to prevent a proton flux across their pores, because water and other aqueous pores efficiently conduct protons, via the so-called Grotthuss mechanism. How proton exclusion is reconciled with the seemingly contradicting task of efficient water permeation, has been addressed by proton transfer simulations, which show that a strong electrostatic field across the pore is likely to be the main determinant of proton exclusion [2].

References
  1. Bert L. de Groot and Helmut Grubmueller; Water Permeation Across Biological Membranes: Mechanism and Dynamics of Aquaporin-1 and GlpF. Science 294:2353-2357 (2001)
  2. Bert L. de Groot, Tomaso Frigato, Volkhard Helms and Helmut Grubmueller; The mechanism of proton exclusion in the aquaporin-1 water channel. J. Mol. Biol. 333: 279-293 (2003).

Dmitrij Frishman

A Domain Interaction Map Based on Phylogenetic Profiling

Phylogenetic profiling is a well established method for predicting functional relations and physical interactions between proteins.We present a new method for finding such relations based on phylogenetic profiling of conserved domains rather than proteins, avoiding computationally expensive all versus all sequence comparisons among genomes. The resulting domain interaction map (DIMA) can be explored directly or mapped to a genome of interest. We demonstrate that the performance of DIMA is comparable to that of classical phylogenetic profiling and its predictions often yield information that cannot be detected by profiling of entire protein chains. A comprehensive DIMA Web-resource will also be presented.

Dan Daley

Experimentally-based topology predictions for membrane proteomes

In the absence of a high-resolution three-dimensional structure, an important corner stone for the functional analysis of any membrane protein is an accurate topology model. A topology model describes the number of transmembrane spans and the orientation of the protein relative to the lipid bilayer. We have previously shown that topology prediction algorithms can be greatly improved by constraining them with an experimentally determined reference point, such as the location of a protein's C-terminus. Such reference points can be obtained most easily through the use of topology reporter proteins. Recently we have mapped the C-terminal location for two membrane proteomes: the E. coli inner membrane proteome and the S. cerevisiae endoplasmic reticulum proteome. This experimental data can be transferred to other membrane proteomes by homology, enabling us to derive experimentally-based topologies en masse.
Lars Arvestad
Last modified: Fri May 27 15:09:59 CEST 2005