SBC seminars 2007

Wed Jan 311:00 David GloriamEBI
162:022A (a.k.a. "coffee and meeting room", the SBC house)In Silico analyses of G protein coupled receptors and standards for antibody and protein array databases and data

The first and main part of the talk concerns bioinformatic studies of G protein-coupled receptors (GPCRs) performed during a PhD at Uppsala University and the department of Neuroscience. It includes results from searches for new human receptor sequences, analyses of the human, mouse and rat GPCRomes and a new classification system for GPCRs. Experiences from genome and protein database mining, phylogenetic analysis, sequence motifs and the usage of expressed sequence tags (ESTs) for gene sequence curation and derivation of preliminary expression profiles are also to be discussed.

The second part will briefly describe the development of (XML) standards for antibody and protein array databases and database exchange. Standards are currently implemented, in a post doctoral project at the European Bioinformatics Institute, by extending the functionality of the IntAct database ( and HUPOs standard for exhange of molecular interaction data ( The effort is part of a European consortium ( which proposes to establish a resource of binding molecules against the entire human proteome.

Wed Jan 1015:15 Jens LagergrenSBC/CSC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Fast Neighbor Joining
Neighbor Joining (NJ) is a very well known distance method for phylogenetic tree reconstruction. It was described by Saitou and Nei in 1988. There has more recently been a number of results attempting to explain why it perform so well or to improve its running time. Isaac Elias and I have given a significantly faster algorithm, FNJ, with basically the same accuracy as NJ and also simplified some of the explanatory results. I will give an elementary explanation of these results and also explain why it all boils down to the seemingly trivial task of counting the number of 1's in a computer register.
Mon Jan 1515:00 Lior PachterUniversity of California at Berkeley
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Why neighbor joining works
We show that the neighbor-joining algorithm is a robust quartet method for constructing trees from distances. This leads to a new performance guarantee that contains Atteson's optimal radius bound as a special case and explains many cases where neighbor-joining is successful even when Atteson's criterion is not satisfied. We also provide a proof for Atteson's conjecture on the optimal edge radius of the neighbor joining algorithm. The strong performance guarantees we provide also hold for the quadratic time fast neighbor-joining algorithm, thus providing a theoretical basis for inferring very large phylogenies with neighbor-joining.
Thu Jan 1814:00 Rickard SandbergMIT
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Using exon arrays to study the transcriptional and post-transcriptional regulation during T-cell activation: preliminary results

Post-transcriptional control of eukaryotic gene expression is more general and elaborate than previously thought. Analysis of sequence data and splicing-sensitive arrays suggests that al ternative splicing regulates half of all human protein-encoding genes and numerous studies ha ve shown that different transcript isoforms (splice variants) have different, sometimes antag onistic, function. For example, Fas exon 6 can be included or skipped to generate mRNAs encod ing, respectively, a membrane bound form of the receptor that promotes apoptosis or a soluble isoform that prevents programmed cell death (Cascino, Fiucci et al. 1995).

We have used Affymetrix exon arrays to study transcriptional and post-transcriptional responses that occur in primary mouse T-cells after activation. The exon arrays (promise to) enable unbiased detection of the expression level of individual exons and may therefore be used to identify alternatively spliced exons and alternative promoters in extension to the gene expression levels. To this end, we developed a novel algorithm to identify alternatively regulated exons, which enabled us to characterize both the post-transcriptional and transcriptional regulation. Our algorithm was validated by real-time PCR analysis of a set of 16 exons, inferred from the exon array data, which included "skipped exons", shifts in promoter and 3'UTR usage. In addition, we are in the process of determining the conservation pattern of the post-transcriptional events, by analyzing exon array data from activated and resting human cord blood derived T-cells. Finally, we would like to use this model system to investigate the coupling between different modes of gene regulation (e.g. miRNAs and alternative 3'UTR usage) as well as the functional importance of specific isoforms.

Wed Jan 2413:15 Rob FinnThe Wellcome Sanger Institute, Hinxton, UK
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Understanding Relationships and Interactions Between Pfam Families

Since its conception a little over 10 years ago, the primary goal of the Protein FAMily database (Pfam) has been the classification of protein domains into distinct families. Over the past two years Pfam has increased dramatically in size, both in terms of the number of families and also the breadth of information available for each family.Our goal is to expand the data within Pfam so as to allow it to be understood in a truly biological context, rather than as simply a collection of families.

As part of this work we have started to classify Pfam entries hierarchically, gathering them, where possible, into groups of related families termed Clans. In this talk I will outline the tools we employ to identify distantly related Pfam families and discuss in detail one such tool, known as SCOOP (Simple Comparison Of OutPuts), which was developed within the group. SCOOP takes the outputs from a set of profile HMM searches and looks for sequences that match multiple HMMs, scoring each match according to the likelihood that it may be expected to occur purely by chance. A higher score, indicating the likelihood that a given match is statistically unlikely by chance alone, is taken to indicated a potential distant similarity. Having identified these distant relationships we are now able to use our understanding of well categorised families to improve our understanding of other, less well characterised ones.

As well as continuing to improve our techniques for annotating domains, we have also begun to increase the number of annotations in Pfam which are based on sequence features, using a number of different approaches. The first approach which will be described here, relies on the development of a methodology for the transfer of experimentally defined active site between sequences within Pfam. The second approach that I will mention uses the Distributed Annotation System (DAS) to retrieve data from disparate resources around the world and integrate them in a single visualisation. We hope to build on these and other data-mining techniques to improve and increase the number of annotations across the Pfam data.

Finally, I will touch on a completely different area of work within the group, which focusses on identifying domain-domain interactions using known three-dimensional structures obtained from the Protein DataBank (PDB). These data are contained in a subsection of Pfam termed iPfam (UK site only). I will describe some of the applications of the iPfam data and describe how we hope to use them to improve our understanding of protein networks at the domain level, as well as in terms of disease-causing mutations.

Wed Jan 3113:15 Aymeric Fouquier d'HérouëlKTH (Aurell group)
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)A glimpse into the NC world of Enterococcus faecalis
Nosocomial infections are often due to microbial strains with high resistence to anti-microbial agents emerging from the artificial selection pressure in hospital environments. A commonly involved bacteria is E. faecalis, causing a variety of infections in the urinary tract as well as life threatening endocarditis. In addision to the clinically induced one, the natural resistence of E. faecalis to common antibiotics makes it a hard case to treat.

In this talk I present an ongoing bioinformatical and experimental approach to identify putative non-coding RNA genes as well as their targets on the pathogen's genome, possibly yielding new ways of treatment. A brief overview of the main ncRNA mechanisms is also given.
Wed Feb 1413:15 Diana EkmanSBC/CBR
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Evolution and function of multidomain proteins
Proteins are composed of domains, recurrent protein fragments with distinct structure, function and evolutionary history. Novel proteins can therefore be created by combining domains into new organizations. Although most domains can exist as single-domain proteins, a majority of them are also combined with other domains. The rearrangement processes whereby novel multidomain proteins are formed have been the main focus of our studies.

Using both structural and evolutionary domain definitions we estimated the number of multidomain proteins in different organisms and found that eukaryotes have approximately 65% multidomain proteins, while the prokaryotes consist of approximately 40% multidomain proteins. However, these numbers are strongly dependent on the exact choice of cutoff for domains in unassigned regions.

Next, we determined that the predominant force in the creation of novel multi-domain proteins is insertion (or deletion) of a single domain at either the N- or C-terminal. However, a notable exception was found for repeating domains, which are often created from duplication of several domains at a time, and the duplications often occur in the middle of the repeats. Further, we studied the timing of these evolutionary events, by mapping domain combinations onto an evolutionary tree. This showed that most domains evolved early in evolution, whereas multicellular organisms evolve mainly through domain rearrangements and less through invention of novel domains. We also estimated that approximately one new domain combination has been created per million years. Finally, we found that the domain contents is important for the interaction properties of proteins. Proteins with many interaction partners often have multiple domains, and in particular, domain repeats.
Wed Feb 2113:15 Olof KarlbergSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Improving interolog identification
Although the amount of available data is increasing and there is close to 100,000 protein interactions reported in human, most of the human interactome is likely to be uncharcterized. A popular approach to compensate for the missing data is to infer protein interactions from interacting orthologs in model organisms, interologs.

The underlying theory is that orthologs have retained their function and hence also their interaction partners during evolution. This has also been proven in practice as various measures show that interaction data derived from model organisms is indeed enriched in true interactions. However, gene family expansions will cause a one to many relation between orthologs and if all orthologs are treated the same, the interolog network will be inflated. This is even more problematic as the experimental interaction data from which the inferences are made is estimated to contain up to 50% false positives.

I have compared different strategies for interolog assignments based on both the popular sequence similarity based Inparanoid program and orthologs from the TreeFam database identified by phylogenetic methods. Preliminary results show that the use of TreeFam can increase the specificity but at the cost of greatly reduced sensitivity.

Wed Feb 2813:15 Lars ArvestadSBC/CSC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)A phylogenetic approach to defining gene families

Identifying gene families is an old problem that many researchers have looked at and there are probably as many solutions as there are researchers. Now there is one more. I will describe some recent work by myself and Jens Lagergren that includes a formal definition of the term gene family and some practical results.

Wed Mar 713:15 Johan GrahnenSBC/CBR
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Detecting errors in experimentally determined transmembrane protein structures using structure prediction methods
Transmembrane (TM) proteins are known to be both vitally important to the cellular machinery and notoriously difficult to crystallize, which makes determining their structure through X-ray crystallography both important and extremely challenging. Even when crystallization is possible one is not guaranteed a physiologically relevant structure, sometimes due to crystallographic artifacts and sometimes due to minute but fatal errors in data processing (see Chang et al., "Retraction", Science vol 314, p 1875). If there was some way of separating the "good" structures from the "bad" ones, or at least raising a "warning flag" when something very unusual crops up, many mistakes could be avoided. I present some preliminary results which indicate that comparing the results from structure prediction methods with the proposed experimental structure is one way of detecting these discrepancies.
Wed Mar 1413:15 Per LarssonSBC/CBR
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Accurate homology modeling of proteins by segment matching

By using a database of highly refined protein X-ray structures, it is possible to build to a high degree of accuracy the stucture of any protein containing only some of its atoms. The database is broken into a set of short segments, which are then fitted onto the framework of the target structure. In the process, three criteria are used for filtering out good segment matches: amino acid residue similarity, conformational similarity (rms deviation) and segment compatibility with the target structure (van der Waals interactions).

For a test set of proteins, that has between 46 and 323 residues, the all-atom rms deviation of the modeled stuctures is between 0.93 and 1.73 Å, values comparable with differences in refining of X-ray structures.

Also, this approach is very suitable for building models using multiple templates. Using multiple template sequences derived from consensus modelling approaches, it is possible to improve upon the single best template in a majority of cases.

Wed Mar 2113:15 Mats LindskogSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Identification of novel cancer candidate genes via functional linkage
A human interactome has previously been constructed by using a Naïve Baysian Network approach with a large number of datasources. (Alexeyenko et al, manuscript in preparation). Evidence from seven different species (human, mouse, rat, fly, worm, yeast, and thale cress) and experimental setups such as co-expression, protein-protein interaction, subcellular co-localization, and phylogenetics profiles have been used. This interactome have been screened with known cancer genes in order to identify novel linkages to genes not previously associated with cancer. An analysis-pipeline have been implemented which identifies novel cancer candidates and subsequently rank these candidate according to the number of linkages to known cancer genes.

In order to find more supportive evidence, the HPA (Human Proteome Atlas) have been screened for protein expression levels of the identified candidates. A comparison of protein expression levels in 18 different cancers with their normal tissue counterparts have been done for the candidate genes.
Wed Apr 1813:15 Serhiy SouchelnytskyiKarolinska Biomics Center, Karolinska University Hospital
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Proteomics data and modeling: search for novel anti-cancer treatments

Proteomics provides description of cancer-related changes in cells.Richness of these datasets requires tools to unveil systemic properties of carcinogenesis.

We perform proteome profiling of tumor and non-cancerous tissues, primary and established human breast epithelial cells. Analysis of the identified cancer-specific proteins showed that multiple regulatory pathways are affected, e.g. metabolism, apoptosis, stress response, serine/threonine and tyrosine kinase signalling. Systemic exploration of the identified proteins indicated that the status of signalling pathways is as important for cancer development, as expression of particular oncogenes or tumor suppressors. Moreover, analysis of systemic properties of cell responsiveness to chemotherapeutics provides the basis for individualized treatment of patients.

Thu May 311:15 Sepp HochreiterJohannes Kepler University Linz
RB15 (Roslagstullsbacken 15, AlbaNova 102:013)Feature Selection in Bioinformatics
Modern measurement techniques in both biology and medicine create a huge demand for new machine learning approaches, especially for feature selection methods. One such technique is the measurement of mRNA concentrations with microarrays, where the genes of interest must be extracted to make predictions or to identify drug targets. In other examples pattern in DNA data indicate alternative splicing, nucleosome positions, gene regulation, etc. All of these tasks are performed by machine learning algorithms which filter out relevant components or detect pattern, i.e. interrelations of components. In the biomedical context, feature selection is a preprocessing step to predict cancer treatment outcomes based on gene expression profiles, to diagnose diseases based on peptide arrays, to classify novel protein sequences into structural or functional classes, and to extract new dependencies between DNA markers (SNP - single nucleotide polymorphisms) and diseases (schizophrenia or alcohol dependence).

Wed May 913:15 Björn AnderssonKarolinska Institutet
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Metagenomic screening for new viruses; Bioinformatics challenges
Virus infections cause many of the largest health problems in the world. It is likely that there is a multitude of unknown viruses that infect humans and it is has been suggested that viruses are involved in causing many common diseases, such as diabetes and MS. The discovery rate of new viruses has until now been slow. I will present initial results from a project to further develop and use a strategy for virus discovery using a genomics/bioinformatics approach. The project has resulted in the development of a pipeline to discover unknown viruses in patient samples and the characterization of several new viruses. The methods include enrichment of virus particles, shotgun sequencing and bioinformatics analyses. The methods have been proven to work efficiently and that they are ready for scaling up to characterize the human virome. Individual virus discoveries will lead to new clinical insights, therapies and diagnostic tools, and we aim to develop this protocol further with the goal of providing a broader picture of human virus infections in relation to disease. I will present the analysis of the sequence data accumulated thus far, including the characterization of two new human viruses, Human Bocavirus and KI Polyomavirus, and a broader description of the known and novel viruses, bacterial, phage and human sequences and completely unknown sequences found in samples of different clinical origin.

Wed May 2313:15 Gunnar von HeijneSBC/CBR
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Identification and evolution of dual-topology membrane proteins
Membrane proteins commonly evolve in size and complexity by gene duplication. A particular intriguing mode of gene-duplication based evolution is when a so-called dual topology protein undergoes duplication to yield two oppositely orientated, homologous proteins. These two proteins may even fuse into a single molecule with an approximate internal symmetry in its 3D structure. We have focused on the small multidrug-transporter family of proteins that provides many examples of dual topology, oppositely orientated, and fused, internally symmetric proteins.

Daley, D.O., Rapp, M., Granseth, E., Melén, K., Drew, D., and von Heijne, G. (2005) Global topology analysis of the Escherichia coli inner membrane proteome. Science 308, 1321-1323.

Rapp, M., Seppälä, S., Granseth, E., and von Heijne, G. (2006) Identification and evolution of dual topology membrane proteins. Nature Struct.Mol.Biol. 13, 112-116.

Rapp, M., Susanna Seppälä, S., Granseth, E., and von Heijne, G. (2007) Emulating membrane protein evolution by rational design. Science 315, 1282-1284.

Wed Sep 1913:15 Erik SonnhammerSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Phobius and its Distributed Annotation Service (DAS)

Phobius has been benchmarked as the most accurate transmembrane topology predictor on single sequences, in particular when signal peptides are treated as unknowns. In the 2004 Phobius publication it was found to be 10-23% more accurate than TMHMM, previously considered the most accurate program. Despite this, the awareness that TMHMM is deprecated and superseded by Phobius is poor. With this talk I hope that at least scientists at SBC will be informed, and hopefully will pass it on to external colleagues.

The novelty of Phobius was that it can predict both signal peptides and transmembrane segments. This allows it to discriminate between these two often confused features, and the accuracy improvement stems from this ability.

To illustrate how important this discrimination is, we carried out the following study: TMHMM and SignalP were applied to five complete proteomes. 30-65% of all SignalP predicted signal peptides and 25-35% of all TMHMM predicted transmembrane topologies overlapped. This casts doubt over the predictions for 5-10% of each proteome, see following article. Phobius resolves these conflicts by making an optimal choice between transmembrane segments and signal peptides. It also allows constrained and homology-enriched predictions.

To make Phobius accessible beyond the web servers and we have also set up a DAS (Distributed Annotation System) server at SBC. This service can be queried with the DAS protocol using any Uniprot accession number, and a topology prediction will be returned. I will describe the DAS system briefly and demonstrate the DAS registry at which currently holds 264 services (62 for

Wed Nov 2813:00 David MessinaSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)DAS, the distributed annotation system, and how to aggregate shared data with it

DAS is a simple protocol designed for easy sharing of biological data. In this talk I will introduce you to DAS, show some examples of data sources that offer their information via DAS, and describe my work on DASher, a viewer application that collects DAS-format annotations and displays them along a protein sequence.

protein sequence).

Wed Sep 2613:15 Lorand Levente & Alexey KutsenkoMTC, Karolinska Institutet
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Prediction of transcriptor factor binding site within the promoter region of one interesting gene. An approach practised by experimental biologists

Expression of the Epstein-Barr virus (EBV) latent membrane protein (LMP1) is regulated by virus- and host cell-specific factors and plays an important role in switching between different types of EBV. The current thinking is that LMP1 is involved in host networks and the expression of LMP1 is controlled by host cell transcription factors (TF). Here we present the results of in silico prediction TF binding sites within the LMP1 promoter region in EBV sequence.

Wed Oct 313:15 Lars EngstrandSMI, Karolinska Institutet
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)The normal microbiota in health and disease - use of high throughput techniques to reveal the human microbiome

Humans host complex microbial ecosystems, which are postulated to contribute to both health maintenance and the development of chronic inflammatory diseases. Alterations in the human gut microbiota have recently been associated with diseases such as cancer, type II diabetes, inflammatory bowel disorder, allergy and obesity. Determining the microbial composition in patients and healthy controls may provide novel therapeutic targets, but is currently a time-consuming and expensive process. Large-scale studies have therefore been prohibited. However, new high-throughput culture-independent molecular tools are now developed, allowing the scientific community to characterize and understand the microbial communities underpinning biological processes in unprecendented ways.

Wed Oct 1013:15 Petter HolmeKTH Computational Biology
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Structure & function of metabolic networks

Modern technology provide us with large amounts of biological data. This opens possibilities to study the large-scale organization and function of a biological system. The information is, however, typically not comprehensive enough to use the same tools for studying large-scale systems (e.g. the metabolism) as small subsystems (e.g. the citric acid cycle). I will discuss network theory in general, and how statistical graph theory can be used to study the organization of metabolism. Specifically I will talk about the role, definition and detection of "currency metabolites" -- ubiquitous substances (like water, carbon dioxide, ATP, etc.) that occur in a multitude of reactions.

Wed Oct 1713:15 Kristoffer ForslundSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Models for domain functional interplay and Gene Ontology function prediction
The relationship between protein domain content and function is important to investigate, both for understanding domains and to annotate proteins in an automated manner. We present two different models for how protein domains combine to yield specific function; one rule-based, one probabilistic, and demonstrate how these are useful for Gene Ontology annotation transfer. The former is an intuitive generalization of the pfam2go mapping, and detects cases of strict functional implications of sets or motifs of domains. The latter uses a Naive Bayesian network-based model to represent the relationship between domain content and annotation terms, and is found to be better adapted to incomplete training sets. We implement these models as predictors of Gene Ontology annotation terms, and the resulting tools are shown to be more effective than conventional best BLAST-hit annotation transfer on a large-scale dataset. We further present a number of cases where combinations of Pfam-A protein domains can be shown to significantly predict functional terms that do not follow from the individual domains.
Wed Oct 3113:15 Karin JuleniusSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Diagnosis of dementia in preclinical phase
The development of a dementia disorder (like Alzheimers disease or Vascular dementia) is a process that takes years of maybe even decades. The period when cognitive deficites may be detected, although the person does not fulfill the diagnostic criteria for dementia, is often called the preclinical phase. Since treatment is more effective the earlier it is started, early diagnosis is benificial.

The Kungsholmen Project was a longitudinal population-based study targeting parsons living in the Kungsholmen parish who were 75 year old or older on October 1, 1987. These persons were subjected to a number of tests approximately every 3 years until 2000, when most of them had passed away. The tests included a number of cognitive tests, testing global cognitive ability, primary memory, episodic memory, visuospatial ability and verbal ability. At each test occasion, each person is classified as either having a type of dementia or being healthy (controls). Between follow-ups some of the controls developed dementia and the test-results of the previous testing of these were compared to those who remained healthy. Substantial differences in performance of these two groups were found for many of the cognitive tests, but the overlap between the two groups is large. My contribution to the project will be to combine the results of different tests to develop av classifier that can predict who runs the risk of developing dementia within 3 years. Since treatment is different, it would also be desirable to distinguish between Alzheimers disease (AD) and Vascular dementia (VaD), although this may prove difficult. Making a differential diagnosis between VaD and AD is not always a straightforward task. Furthermore, the cause of the dementia is very often a mixture of the two, especially in the oldest age groups.

Wed Nov 713:15 Maria WernerKTH Computational Biology
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)A computational study of the lambda-lac mutants

We present a comprehensive computational study of some 900 possible lambda-lac mutants of the lysogeny maintenance switch in phage lambda, of which up to date only 19 have been studied experimentally (Atsumi & Little, PNAS 103: 4558-4563, (2006)). We clarify that these mutants realise regulatory schemes quite different from wild-type lambda, and can therefore be expected to behave differently, within the conventional mechanistic setting in which this problem has traditionally been framed. We verify that indeed, with reasonable modelling assumptions and across this wide selection of mutants, the lambda-lac mutants for the most part either have no stable lytic states, or should only be inducible with difficulty. In particular, the computational results contradicts the experimental finding that four lambda-lac mutants both show stable lysogeny and are inducible. This work hence suggests either that the four out of 900 mutants are special, or that lambda lysogeny and inducibility are holistic effects involving other molecular players or other mechanisms, or both. The approach illustrates the power and versatility of computational systems biology to systematically and quickly test a wide variety of examples and alternative hypotheses for future closer experimental study.

Wed Nov 1413:15 Amilcar FloresCMM, Karolinska Institutet
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Molecular studies of SOCS action

It is of principal importance to understand how different signalling pathways are changed in relation to different disorders. In this seminar, the JAK-STAT-SOCS signalling pathway will be discussed. This pathway is known to be of relevance for immune functions, metabolic control and cancer. In the seminar, the components of this pathway will be briefly reviewed as will the principles how the pathway is activated and inactivated. A particular emphasis will be made on STAT5 and SOCS2 - these factors seem to be important in growth regulation and in prostate cancer. The potential of using high through-put technologies and predictive methods to probe the JAK-STAT pathway will be exemplified.

Wed Nov 2113:15 Aymeric Fouquier d'HerouelKTH Computational Biology
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Function and Identification of ncRNA in Bacteria

Non-coding (nc) RNAs are emerging as important and surprisingly ubiquitous players in gene regulation across all kingdoms, adding new layers of complexity to the regulatory machinery. Beside challenging views about the junk content of higher organisms' DNA and opening possible explainations to the riddle of high complexity vs low gene number observed in many organisms.

I present the sequel to a genome wide ncRNA search in the genome of enterococci which is eventually applicable to other prokaryotes and to some extent to eukaryotic and viral genomes.

Wed Nov 2813:00 David MessinaSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)DAS, the distributed annotation system, and how to aggregate shared data with it

DAS is a simple protocol designed for easy sharing of biological data. In this talk I will introduce you to DAS, show some examples of data sources that offer their information via DAS, and describe my work on DASher, a viewer application that collects DAS-format annotations and displays them along a protein sequence.

Wed Dec 516:00 Gabriel ÖstlundSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Agreement between proteome and transcriptome: a study of correlation between two large-scale datasets

With the human genome sequenced much effort is put into elucidating the interactions and functions of the encoded proteins. This has been done focusing on both the proteome and the transcriptome with analysis of e.g. co-expression, co-localization and tissue profiling. How mRNA expression correlates with relative protein amounts could be of utmost importance e.g. when using protein co-expression as an indicator of protein functional coupling. Pairwise correlations at the gene level were generated for two datasets. One data set of antibody-based tissue profiling of proteins, from the human protein atlas, and one of tissue-specific patterns of mRNA expression, from the GNF transcriptome atlas. This was done using Pearson- and Spearman correlations as well as mutual information. Finally the correlation between the two datasets was calculated, revealing a moderate correlation which was nevertheless significantly higher than random simulations.

Wed Dec 1216:00 Örjan ÅkerborgSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)A computational screen implicating A-to-I editing as a key mechanism in fine-tuning proteome diversity

Several bioinformatic approaches have previously been used to find novel sites of Adar mediated A-to-I editing in human. These studies have resulted in the discovery of thousands of genes hyper-edited in their non-coding regions but very few substrates that are site selectively edited. We have compiled a screen to search for new sites of selective editing primarily in coding sequences. To avoid hyper-edited repeat regions we have applied our screen to the alu-free mouse genome. First we construct an explorative screen based on RNA structure and genomic sequence conservation. We evaluate the explorative screen by means of enrichment of A-G mismatch, that is, the discrepancy between the expressed sequence and the genomic template for A-to-I edited sites. The enrichment and the corresponding p-values implicate A-to-I editing as a key mechanisms in fine tuning proteome diversity. Known substrates suggest that A-to-I editing is particularly important for normal brain development in mammals. Subsequently, we extend the explorative screen by including A-G mismatch as well as a specific scoring scheme based on characteristics for known A-to-I edited sites. The result of applying our extended screen to the mouse genome gives a substantial number of novel putative substrates of which 63 are currently experimentally validated.

Wed Dec 1916:00 Anna HenricsonSBC
Seminar room RB35 (Roslagstullsbacken 35, the SBC house)Domain tree based analysis of protein architecture evolution

Understanding the dynamics behind domain architecture evolution is of great importance to unravel the functions of proteins. Complex architectures have been created throughout evolution by rearrangement and duplication events. An interesting question is how many times a particular architecture has been created, a form of convergent evolution or domain architecture reinvention. Previous studies have approached this issue by comparing architectures found in different species. We wanted to achieve a finer-grained analysis by reconstructing protein architectures on complete domain trees.

The prevalence of domain architecture reinvention in 96 genomes was investigated with a novel domain tree based method that uses maximum parsimony for inferring ancestral protein architectures. Domain architectures were taken from Pfam. To ensure robustness, we applied the method to bootstrap trees, and only considered results with strong statistical support.

We detected multiple origins for 12.4% of the scored architectures. In a much smaller dataset, the subset of completely domain-assigned proteins, the figure was 5.6%. These results indicate that domain architecture reinvention is a much more common phenomenon than previously thought. We also determined which domains are most frequent in multiply created architectures, and assessed whether specific functions could be attributed to them. However, no strong functional bias was found in architectures with multiple origins.

Anna Henricson
Last modified: Dec 19 2007