SBC seminars 2005

Wed Jan 2615.15 Isaac EliasSBC
A 1.375-Approximation Algorithm for Sorting by Transpositions
Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a different location. The complexity of this problem is still open and it has been a ten-year-old open problem to improve the best known 1.5-approximation algorithm. We provide a 1.375-approximation algorithm for sorting by transpositions. The algorithm is based on new results regarding the diameter of three subsets of the symmetric group: We determine the exact transposition diameter of 2-permutations and simple permutations, and find an upper bound for the diameter of 3-permutations.

Joint work with Tzvika Hartman, Dept. of Molecular Genetics, Weizmann Institute of Science

Wed Feb 1615.15 Åsa BjörklundSBC
Multi-domain proteins in the three kingdoms of life
Comparative studies of the proteomes from different organisms have provided valuable information about protein domain distribution in the kingdoms of life. Earlier studies have been limited by the fact that only about 50% of the proteomes could be matched to a domain. We have extended these studies by including less well-defined domain definitions, Pfam-B and clustered domains, in addition to Pfam-A and SCOP. It was found that a significant fraction of these domain families are homologous to Pfam-A or SCOP domains. Further, we have shown that all regions that do not match a Pfam-A or SCOP domain contain a significantly higher fraction of disordered structure. These unstructured regions may be contained within orphan domains or function as linkers between structured domains. Using several different definitions we have re-estimated the number of multi-domain proteins in different organisms and found that several methods all predict that eukaryotes have about 65% multi-domain proteins, while the prokaryotes consist of around 40% multi-domain proteins. In conclusion, all eukaryotes have similar fractions of multi-domain proteins and disorder whereas a high fraction of repeats is distinguished only in multicellular eukaryotes. This implies a role for repeats in cell-cell contacts while the other two features are important for intracellular functions.
Wed Feb 2315.15 Mathias Uhlén, KTH
The Swedish Human Proteome Resource
Status and recent progress in HPR
Wed Mar 215.15 Hkan Viklund, SBC
Wed Mar 915.15 Erik Sonnhammer, CGB
New algorithms for sequence distance estimation and for HMM searching
Wed Mar 1615.15 Diana Ekman, SBC
Evolution of multi-domain proteins
Most eukaryotic proteins are multi-domain proteins, i.e. consist of more than one protein domain. We have studied the events that have built these multi-domain proteins. Proteins have been compared based on their domain architectures and a "domain distance" calculated for each pair of proteins. The insertions, repetitions and exchanges of domains that distinguish a protein from its nearest neighbors have been counted and it was found that insertions are somewhat more common than repetitions and that exchanges are not very frequent. Further, insertions and repetitions are approximately equally common at the N- and C-terminals while exchanges have been found more often at the C-terminal. In addition we show that domain distance correlates well with sequence similarity and semantic similarity, based on GO-annotations, and we use this measure to build evolutionary trees. Domain evolution is then exemplified with two modular protein families, non-receptor tyrosine kinases and the Rho GEFs, and a pattern of repetition.
Wed Mar 2315.15 Erik Granseth, SBC
Halftime seminar: Characterization of the membrane-water interface region of membrane proteins and determination of the membrane proteome of E. coli
The amount of membrane proteins with known structure grows exponentially. Today, there are approximately 100 membrane proteins deposited in PDB, which was the number of soluble proteins available in 1974. In my half-time seminar, I will first present a study of the membrane-water interface region. Most statistical studies have focused on the membrane region, but this study moves a few ngstrm away from the membrane and discusses the structural constraints imposed on the transmembrane helices. The second part of the presentation will be about global topology analysis of the E coli membrane proteome. By experimentally finding out whether the Cterminal of a membrane protein is located in the cytoplasm or periplasm, high quality topology models for 601 membrane proteins have been produced, which is important for future functional studies.
Fri Mar 3010.00 Johannes Frey-Sktt
Analysis of alternative splicing
Halftime seminar
Wed Apr 615.15 Tomas OhlsonSBC
Halftime seminar
Wed Apr 2015.15 Marie hman MolBio/SU
An approach to find novel sites of mRNA editing
The ADAR enzymes (1 and 2) catalyze the conversion of adenosine (A) into inosine (I) in RNA by a hydrolytic deamination. This A to I editing acts on RNA that is double stranded, without a defined consensus recognition sequence. Site selective editing has mainly been found in the pre-mRNA of genes involved in neurotransmission in the mammalian brain. Apart from generation of multiple protein isoforms by codon changes, RNA editing plays important roles in regulating other RNA processing events like splicing. We have developed a method that can be used on various tissues as well as species to detect novel single sites of A to I editing. The method is based on an immunoprecipitation assay followed by analyses on microarray. RNA substrates subjected to site selective editing are retrieved from RNA-protein complexes using ADAR2 antibodies. Combined with computational analysis we anticipate to find novel sites of editing that has been overlooked by other experimental methods and computational analyses. Using this approach it is possible, in a unique way, to discover single sites of selective A to I editing.
Monday Apr 2515.00 Karin Melen
Half time seminar: Increasing the accuracy of membrane protein topology prediction: Application on whole proteomes of E. coli and S. cerevisiae.
In an ideal world the structures of all proteins in every organism would be solved and the functions of all proteins would be identified. If we knew the structures of membrane proteins, drugs would be more easily developed and mankind would hopefully be happier. We are not there yet but on our way efforts are made not only to solve protein structures but also to gain insights about structure and function by other means. Here we are trying to increase the knowledge about structural features of membrane proteins by improving TMHMM, a well-known method for membrane topology prediction, and applying the refined method to whole proteome studies. I will present TMHMMfix, which enables incorporation of experimental information into the predictions, something that has turned out to improve the accuracy significantly. I will also show how we have used TMHMMfix to map the topology of nearly all membrane proteins in E.coli and S.cerevisiae.
Wed Apr 2715.15 Tomas Bergstrm LCB, Uppsala
Genome Divergence between Humans and Chimpanzees: A story about substitutions, indels and alternative splicing
The high quality sequence of the chimpanzee chromosome 22 has been compared to the orthologus human chromosome 21. The relative contribution of substitutions and insertions/deletions (indels) was analyzed for the 33 Mpb alignment. A particular focus was made on indels in coding regions and their effect on alternatively spliced transcripts.
Wed May 1115.15 in FA31 Sergei MaslovBrookhaven National Laboratory
Detecting topological patterns in protein networks.
Bio-molecular networks lack the top-down design. Instead, selective forces of biological evolution shape them from raw material provided by random events such as gene duplications and single gene mutations. As a result individual connections in these networks are characterized by a large degree of randomness. One may wonder which connectivity patterns are indeed random, while which arose due to networks's growth, evolution, and/or its fundamental design principles and limitations? Here we introduce a general method [1,2] allowing one to construct a random version of a given network while preserving the desired set of its low-level topological features, such as, e.g., the number of neighbors of individual nodes, the average level of modularity, numbers of small network motifs, etc. Such a null-model network can then be used to detect and quantify non-random topological patterns. In particular, we measure correlations between numbers of neighbors of interacting nodes in protein binding and regulatory networks in yeast [1]. It was found that in both these networks, links between highly connected proteins are systematically suppressed. We proceed by presenting a set of empirical findings about how gene duplications shape protein interaction and genetic regulatory networks in several organisms [3]. It is shown that molecular networks in yeast combine the plasticity of regulatory connections with a relative stability of protein functions manifested in the set of their binding partners. We believe this to be a general feature affecting the evolvability of bio-molecular networks.
  1. S. Maslov and K. Sneppen, Specificity and Stability in Topology of Protein Networks, Science 296, 910-913, (2002).
  2. S. Maslov, K. Sneppen, and A. Zaliznyak, Pattern Detection in Complex Networks: Correlation Profile of the Internet, Preprint at, (2002); Physica A 333, 529-540 (2004).
  3. S. Maslov, K. Sneppen, and K. Eriksen, Upstream Plasticity and Downstream Robustness in Evolution of Molecular Networks. Preprint in (2003); BMC Evolutionary Biology, 4:9, pp. 1-12 (2004).
Wed May 1116.00 at DBB Bjorn WallnerSBC
Pre-dissertation seminar
Please notice time, date, and place! Spring Siv Andersson, Uppsala
Computational inference of scenarios for alpha-proteobacterial genome evolution
This seminar is postponed!
Wed May 1815.15 Lars ArvestadSBC
Querying Pubmed and managing citations from the commandline.
I will describe and demo my system for interacting with PubMed from the commandline. This started of as a simple way of extracting BibTeX citations from PubMed, but is now a simple yet powerful tool for submitting searches, archiving selections of articles, and accessing information.
Wed May 2515.15 Thomas BrglinDept. Bioscience and CGB, KI
Solving C. elegans developmental biology: experimental approaches and development of computational tools
Sat-Sun May 28-29 SBC Workshop
Comparative Genomics and Protein Structure
Tue May 3115.15 Henrik Kaessmann, Center for Integrative Genomics, Lausanne
  Wed Aug 3113.00 Kenta Nakai
Searching for sequence determinants of the translation efficiency of proteins in a cell-free system
Wed Sept 2815.15 Olof EmanuelssonSBC
Genomic tiling microarrays and the ENCODE project
Genomic tiling microarrays have recently become a popular platform for interrogating the activity of large genomic regions in an unbiased fashion. I will introduce some key concepts of tiling microarrays, along with some recent applications within the human genome world. The focus will mainly be on how to use different tiling microarray strategies for transcription mapping. I will also give an introduction to the world-wide ENCODE consortium.
Wed Oct 515.15 Lukasz HuminieckiKI
Evolution of Expression Pattern Diversity: a combined computational and experimental approach
To examine the process by which duplicated genes diverge in expression, we studied how transcriptional profiles of orthologous gene sets in human and mouse were affected by the presence of additional recent species-specific paralogs. Gene expression profiles were compared across 16 homologous tissues in human and mouse using microarray data from the Gene Expression Atlas 1 (integrated with LocusLink and Ensembl) for 1575 sets of orthologs, including 250 with species-specific paralogs. Orthologs that have undergone recent duplication were less likely to have strongly correlated expression profiles than those that remained in a one-to-one relationship. Our results suggest that gene expression profiles are surprisingly labile, especially in lineages where a duplication event has occurred, and that transcription in a particular tissue may be repeatedly gained or lost during the evolution of even small gene families [Huminiecki and Wolfe, 2004]. Other researchers have also noted that orthologs may be poorly correlated in their expression profiles [Jordan IK et al., 2005; Khaitovich et al., 2004]. However, it is yet difficult to resolve how much of the variability reflects real biology, and how much could be attributed to differences in sample ontology, the extraction procedure, RNA isolation, microarray setup, cross-hybridization, or bioinformatics. For example, we have previously shown that internal consistency of many publicly available expression datasets is rather low, and that expression profiles for the same gene derived from different experimental platforms (such as SAGE, ESTs or microarrays) do not in general correlate well [Huminiecki et al. 2003; Huminiecki and Bicknell, 2000]. In collaboration and with generous support from Pfizer UK (Sandwich, Kent), we are currently generating a qPCR/TaqMan-based dataset of expression profiles for a number of G-protein coupled receptors (GPCRs) of pharmaceutical interest. We are focusing on peripherally expressed type-A GPCRs in human, mouse, rat, guinea pig and dog. Quantitative PCR is more specific, sensitive and precise than microarray platforms. Thus, we hope that this novel approach will help to establish the true extent of conservation of gene expression patterns in placental mammals. In addition, we will strive to gain a deeper understanding into the relevance and suitability of animal species used for functional efficacy and toxicological studies at Pfizer.
  • Jordan IK, Marino-Ramirez L, Koonin EV. Evolutionary significance of gene expression divergence. Gene. 2005 Jan 17;345(1):119-26.
  • Khaitovich P, et al. A neutral model of transcriptome evolution. PLoS Biol, 2004. 2(5): p. E132.
  • Huminiecki L and Bicknell R. In silico cloning of novel endothelial-specific genes. Genome Res. 2000 Nov; 10(11): 1796-806.
  • Huminiecki L, AT Lloyd and KH Wolfe. Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases. BMC Genomics, 2003. 4(1): p. 31.
  • Huminiecki L and Wolfe KH. Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse. Genome Res, 2004. 14(10A): p. 1870-9.
Mon Oct 1015.30 Tomas Ohlson
Improving protein sequence alignments using evolutionary information and machine learning techniques
Pre-dissertation seminar Location: Magnelisalen, Stockholm University

The quality of the alignment might be the most important step in protein structure prediction using homology modeling. For closely related sequences one can use PSI-BLAST to produce satisfying alignments, but when the sequences are more distantly related PSI-BLAST can't make any good alignments. Instead profile-profile alignment methods can be used. I will present a benchmarking study on profile--profile alignment methods and also how the alignments can be further improved using machine learning techniques.

Wed Oct 1215.15 Andrey Alexeyenko In room FA32, Albanova
FunCoup: A multi-facetted predictor offunctionallinksbetween genes of eukaryotic organisms
The method called FunCoup is aimed to discover novel functional links between proteins (genes). FunCoup uses Bayesian networks optimized with multivariate techniques to integrate genomics and proteomics information of various types and from different sources. The crucial novelty is involving data on orthologous genes in well-studied model organisms. The procedure of finding and treating orthologs employs method of InParanoid and thus is optimized for eukaryotic genomes, which often gives strong additional evidence for functional links. FunCoup uses a range of data sources, from loose associations like co-expression, to physically interacting proteins and phyletic profiles: While finding links, FunCoup incorporates available data for such model organisms as mouse, rat, D. melanogaster, C. elegans, and yeast. Phyletic profiles are also built with a number of less studied genomes. A key feature of the Bayesian approach is that each data source and organism is weighted by its reliability and relevance.

The quality of the predictions has been cross-validated and assessed on test sets. During the testing (assisted with ANOVA techniques) we found that no previously known particular innovation in the field can be trusted without multiple sampling from various genomes and functional classes.

The output of FunCoup is the likelihood that two proteins are functionally coupled (compared to the expected background probability for random protein pairs). Found links can be used for focussing work on individual proteins of interest or creating sophisticated gene networks.

Wed Oct 2615.15 Alexander SchliepMax Planck Inst. for Mol. Genetics, Berlin
In FA32, AlbanovaAnalyzing ArrayCGH Data using HMMs with Non-homogenous Markov Chains
Comparative genomic hybridization using DNA-microarrays (ArrayCGH) studies can elucidata copy number changes due to diseases or genetic reasons. If BAC clones are used as probes, a positional proximity of probes on the chromosome begins to show a pronounced effect. A natural model is to consider the differential hybridization of probes along the chromosome as a sequence of observation in which the correlation between subsequent positions depends on their distance and overlap. Prior work on analyzing gene expression in presence of chromosomal aberrations focused on more classical statistical approaches, neglecting proximity effects. We present the first approach to model proximity effects explicitely using Hidden Markov Models with an underlying time-inhomogeneous Markov chain. Thus we develop a more realistic model of the events influencing the change of gene expression levels over regions. We will introduce the basic approach, the necessary extensions to the HMM framework and an argument against the use of segmentation approaches.
Wed Nov 215.15 Abhiman Saraswathi
Analysis and prediction of functional shifts in protein families
Gene duplications are an important phenomenon, whereby genes in an organism could acquire subfunctionalisation or neofunctionalisation. Presently, groups of proteins are clustered in to families based on sequence similarities and have one or more general biochemical functions in common. It is also known that different subgroups within these families have evolved slightly different functions, such as different substrate specificities, activities and mechanisms. It is important to detect such functional differences between members of a protein family for a more accurate annotation of function. Novel measures developed by us for the prediction of functional shifts between protein subfamilies will be presented. These new measures were able to discriminate between subfamily pairs with same enzyme function and subfamily pairs with different enzyme functions. We show that the discrimination is preserved irrespective of the methods used and also improves for larger subfamilies. Moreover, we combine the proposed measures to increase the overall prediction power. FunShift, a database of function shift analysis on protein subfamilies will also be presented. The database can be accessed at
Wed Nov 913.00 Samuel Andersson
The Motif Yggdrasil sampler: A tree-based Gibbs sampler for detection of transcription factor binding sites.
Please note the time change to accomodate Samuel's teaching.

In phylogenetic foot-printing, putative regulatory elements are found in upstream regions of orthologous genes by searching for common motifs. Motifs in different upstream sequences are subject to mutations along the edges of the corresponding phylogenetic tree, and taking advantage of the tree in the motif search is an appealing idea. We describe the Motif Yggdrasil sampler; the first Gibbs sampler based on a general tree that uses unaligned sequences. Previous tree-based Gibbs samplers have unrealistically assumed a star-shaped tree or partially aligned upstream regions. We give a probabilistic model describing upstream sequences with regulatory elements and build a Gibbs sampler with respect to this model. We apply the collapsing technique to eliminate the need to sample nuisance parameters, and give a derivation of the predictive update formula. The use of the tree achieves a substantial increase in nucleotide level correlation coefficient both for synthetic and biological data.

Wed Nov 1615.15 TBA
Wed Nov 2315.00 Hkan Viklund
FA31Transmembrane proteins, from sequence to structure
In the popular formulation of the protein folding problem, the goal is to find an algorithm that can predict the three dimensional structure of a protein given its amino acid sequence. And for the last 30 years, this has remained one of the most basic unsolved problems in bioinformatics. This talk will deal with some aspects regarding this problem in the context of transmembrane proteins. Specifically, the fields of topology prediction and topology modeling and their possible role in contributing to solving the complete folding problem will be discussed.
Wed Nov 3015.15 Claes Malmns
FA32Models of TF-DNA interactions in S. cerevisiae
In order to be able to carry out their function properly, transcription factors (TFs) must have a high equilibrium binding probability to their targets. Moreover, the TFs need to find these targets in a reasonable time. The search involves a combination of 1D and 3D diffusion. During 1D diffusion, the TF is bound to the DNA - either specifically or non-specifically - but is able to slide along it. The need for short search time and high binding probability of targets imposes constraints on the TF-DNA interactions. We test two competing models of TF-DNA interactions by using a combination of experimental data on the number of TFs of different kinds present in an S. cerevisiae cell and theoretical estimates of binding energies of TF-DNA interactions.

The work presented has been carried out in collaboration with Erik Aurell and Aymeric Fouquier d'Herouel, KTH and Massimo Vergassola, Pasteur Institute.

Mon-TueDec 5-6Special Event
17th Bi-annual Stockholm-Copenhagen Bioinformatics Meeting
Place: Geovetenskapliga byggnaden, Frescati.
Wed Dec 715.15 Mikael Oliveberg
FA31 Protein Folding, Misfolding and Neurodegenerative Disease: How proteins maintain their right shapes and what happens when they don't. (Tentative titel)
Proteins control our lives down to the smallest detail. Even so, the question of how a protein is formed is one of life' great mysteries. In a split second, the floppy protein chain forms itself into a ball with a unique shape and function. Occasionally, however, they get trapped in a wrong shape and run amok with devastating consequences for the cells. The understanding of these protein folding and misfolding processes are critical for finding rational treatment of many debilitating conditions like Alzheimer' disease, ALS and the prion diseases. Basically, the underlying principle is simple: the tension between fat and water. Just as fat is attracted to fat, and water to water, the proteins are controlled in the cells and join together correctly to form the right shape all by themselves. If one part loosens, it is automatically pulled back. This remarkable ability to self assemble is at least partly orchestrated by amino acids that sit like guards making sure that no wrong knots are made. If you remove them, the proteins distort, stick together by exposure of their greasy interior and kill the cells. Suddenly, the uniting force has been turned against us. But the most fascinating thing about proteins is not that they can go wrong, but that they function at all. What stops chaos from taking over?
Wed Dec 1416.00 Sara Light


Pre-dissertation seminar