SBC seminars 2013

Tuesday September 1011:00 Walter BasileStockholm University
Lunch room at Scilifelab, alpha floor 2 ORF conservation and de-novo creation in the yeast Saccharomyces cerevisiae
The yeast Saccharomyces cerevisiae was one of the first organisms to be fully sequenced. At the time of that study, it was noted how for a significant percentage of described ~6000 ORFs it was not possible to find a homolog in any other species. It was at first believed that with more sequenced genomes available the number of those "orphan" genes would decrease, but that is not the case. It is now accepted that more than 12% of all S. cerevisiae genes are orphans, with ~20% of them being species-specific. We conducted a large scale study of the yeast genome, by analyzing each ORF through a computational pipeline aimed at finding the syntenic region in other 17 fully sequenced fungal species and 12 S.cerevisiae strains. We used a clustering approach to divide the gene set into 8 groups with different level of conservation, and for each group we described the most prominent intrinsic properties as well as experimental evidences. Finally, a phylogenetic tree is proposed, that summarizes gene gains and losses across the entire fungal clade.
Tuesday September 310:30 Michael Y. GalperinNCBI, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Lunch room at Scilifelab, floor 2 Genomic and biogeochemical clues to the origin of line
In the past, origin of life on Earth has been treated mostly as a philosophical problem with little connection to everyday biological research. Even after the possibility of abiotic origin of amino acids and nucleic acid bases had been demonstrated in 1953, there has been no agreement on the energy source(s) for the formation of increasingly complex biopolymers (redox or thermal gradients, UV, atmospheric electricity, etc.), the driving force(s) leading to the emergence of the first life forms (natural selection vs spontaneous self-organization), their properties (RNA-based vs metabolism-based, autotrophic vs heterotrophic, etc.), or place of origin (deep sea vs fresh water). The availability of genomic data for diverse bacteria, archaea, and eukaryotes, including various extremophiles, allowed us to take a new look at this problem. By identifying the common genome core of all (known) living organisms, and the shared properties of their cells, it has become possible to deduce simple and reasonable biogeochemical constraints on the conditions that led to the origin of life and to get an insight on where it has happened and how. In turn, these reconstructions lead to new questions that can now be addressed experimentally, bringing the whole enterprise into the realm of "normal" science. The most surprising result of these studies is the growing impression that the origin of life has been a natural consequence of the geochemical conditions that existed on the primordial Earth, rather than a one-time improbable accident.

Mulkidjanian AY and Galperin MY (2009) Biol Direct 4:27. PMID:19703275
Mulkidjanian AY et al. (2012) Proc Natl Acad Sci USA 109:E821. PMID: 22331915
Tuesday June 1110:30 Teepo NiinimäkiUniversity of Helsinki, Helsinki, Finland
Lunch room at Scilifelab, floor 2 Treedy: A Heuristic for Counting and Sampling Subsets
Consider a collection of weighted subsets of a ground set N. Given a query subset Q of N, how fast can one (1) find the weighted sum over all subsets of Q, and (2) sample a subset of Q proportionally to the weights? We present a tree-based greedy heuristic, Treedy, that for a given positive tolerance d answers such counting and sampling queries to within a guaranteed relative error d and total variation distance d, respectively. Experimental results on artificial instances and in application to Bayesian structure discovery in Bayesian networks show that approximations yield dramatic savings in running time compared to exact computation, and that Treedy typically outperforms a previously proposed sorting-based heuristic.
Monday June 1014:00 Gabriele OrlandoMaster thesis, Stockholm University
Lunch room at Scilifelab, floor 2 Large-scale prediction of GPCR structures
G protein-coupled receptors (GPCRs) are involved in many biological processes and are one of the most important families of drug targets. However, only a few of their structures are known and the way drugs bind to their orthosteric sites is still largely unknown. Accurate modeling of GPCR binding sites could allow the design of new drugs, using in silico screening. The goal of this project is to build a program that automatically infers the structures of GPCRs by homology and evaluates their ability to discriminate the active ligands from a pool of random drug-like molecules. We tested the program by modeling the structures of the dopamine and the histamine receptors and evaluated each model by docking a pool of known active ligands taken from the ChEMBL database mixed with random druglike molecules. We also explored novel methods that could increase the accuracy of the models based on sampling of the conformations of the side chains in the orthosteric site. We found that some of our methods can significantly increase the quality of the models.
Tuesday June 410:30 Satish NairStockholm University
Lunch room at Scilifelab, floor 2 Investigating polyG repeat variation between individual Single cells
The advancement in the technology of massive parallel sequencing have helped us to obtain more information about the DNA. Together with the development of new softwares in the Bioinformatics field, we are now able to get reliable and efficient variable calling, from the data obtained from the massive parallel Sequencing. It has been studied that the polyG regions in the DNA are more prone to replication errors when compared to the other homopolymer repeats but importantly more variable than non homopolymer regions. We think these mutations at the polyG regions could contain useful information which could lead us to find the relation between the single cells. In this project we are using the NGS data from the BGI paper containing 20 tumor Single cell samples, 5 Normal Single cell samples and 1 Normal tissue sample. A pipeline was designed which is more suited for the variant calling of the homopolymer repeats using Bowtie2 and Gatk for mapping and variant calling repectively. A phylogentic analysis was made based on the SNP's as well as INDEL's. Since it is not possible to use the classical phylogeny approach for polyG indels an inhouse script in python was designed to calculate the indicative distance which is a relative distance measure between the two single cells. A bootstrap analysis have been done to see the reliability of the data obtained. The analysis clearly shows that the cancer cells and the Normal cells cluster seperately with good bootstrap values for both SNP's and INDEL's. These results indictaes that the variation in polyG repeats could be used to find how the Single cell samples are related to each other. This analysis has been done on the exome sequences which normally have less mutation rates. If we are able to cluster the cancer and normal samples with the variants available from the exome sequencing, then we could get a better and more specific phylogenetic analysis from whole genome sequencing data.
Tuesday June 4~11:00 Daniele RaimondiStockholm University
Lunch room at Scilifelab, floor 2 Master thesis presentation: Deep learning ensemble methodology for direct information contact prediction
Recently, several new contact prediction methods have been published. They use (i) large sets of multiple aligned sequences (ii) and assume that correlations between columns in these alignments can be the results of residue interactions and thus clues of residues spatial proximity in the native structure. These methods are clearly superior to earlier methods when it comes to predicting contacts in proteins. PconsC [2] has been developed by Marcin J. Skwark and combines predictions from two direct information methods, PSICOV [4] and plmDCA [3], and two alignment methods, HHblits and jackHmmer, at four different e-value thresholds, obtaining an improvement of the predictive performances with respect to the single methods on which it is based. The aim of this thesis project was to further improve the quality of these predictions.

To achieve this goal, I developed a Deep Learning architecture able of performing structured predictions, taking into consideration the significant amount of information underlying the contact prediction problem instead of simply considering each residue pair independent from the others (in [1] has been shown how contacts in the native structure can hardly involve a single pair of residues). I implemented a multilayer learner using Random Forest classiffers that improves contact predictions by being able to abstract some typical inter/multi residue relationships among neighbouring residue pairs, namely by learning how to recognize frequent visual patterns (mainly Secondary Structure features, such as alfa-helices and beta-sheets) in the contact maps. This abstraction ability can relocate the most uncertain predictions into the recognized patterns, reconstructing them and thus improving significantly the precision of the overall contact map.

This Deep Learning approach, along with some additional features (e.g. predicted Secondary Structure and predicted Relative Solvent Accessibility) can provide a further 20% improvement of PconsC predictive performances.

[1] Pietro Di Lena, Ken Nagata and Pierre Baldi, Deep architectures for protein contact map prediction Vol. 28 no. 19 2012, pages 2449 2457 BIOINFOR- MATICS doi:10.1093/bioinformatics/bts475
[2] Marcin J. Skwark, Abbi Abdel-Rehim and Arne Elofsson, PconsC: Combi- nation of direct information methods and alignments improves contact pre- diction, Bioinformatics (2013) doi: 10.1093/bioinformatics/btt259
[3] Ekeberg, M., Lovkvist, C., Lan, Y., Weigt, M., and Aurell, E. (2013). Im- proved contact prediction in proteins: Using pseudolikelihoods to infer Potts models, Phys Rev E Stat Nonlin Soft Matter Phys, 87(1-1), 012707.
[4] Jones, D., Buchan, D., Cozzetto, D., and Pontil, M. (2012), PSICOV: pre- cise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments, Bioinformatics, 28(2), 184 190.
Thursday May 2314.00 Rune LindingTechnical University of Denmark, Lyngby, Denmark
Lunch room at Scilifelab, floor 2 Biological Forecasting and Cancer Kinome Networks
Biological systems are composed of highly dynamic and interconnected molecular networks that drive biological decision processes. The goal of network biology is to describe, quantify and predict the information flow and functional behaviour of living systems in a formal language and with an accuracy that parallels our characterisation of other physical systems such as Jumbo-jets. Decades of targeted molecular and biological studies have led to numerous pathway models of developmental and disease related processes. However, so far no global models have been derived from pathways, capable of predicting cellular trajectories in time, space or disease. The development of high-throughput methodologies has further enhanced our ability to obtain quantitative genomic, proteomic and phenotypic readouts for many genes/proteins simultaneously. Here, I will discuss how it is now possible to derive network models through computational integration of systematic, large-scale, high-dimensional quantitative data sets. I will review our latest advances in methods for exploring phosphorylation networks. In particular I will discuss how the combination of quantitative mass-spectrometry, systems-genetics and computational algorithms (NetworKIN [1] and NetPhorest [4]) made it possible for us to derive systems-level models of JNK and EphR signalling networks [2,3]. I shall discuss work we have done in comparative phospho-proteomics and network evolution[5-7]. Finally, I will discuss our most recent work in analysing genomic sequencing data from NGS studies and how we have developed new powerful algorithms to predict the impact of disease mutations on cellular signaling networks [8,9].


Linding et al., Cell 2007.
Bakal et al., Science 2008.
Jorgensen et al., Science 2009.
Miller et al., Science Signaling 2008.
Tan et al., Science Signaling 2009.
Tan et al., Science 2009.
Tan et al., Science 2011.
Creixell et al., Nature Biotechnology Sep 2012.
Erler & Linding, Cell May 2012.
Tuesday May 1410:30 Muhammad Owais MahmudiKTH, Royal Institute of Technology
Lunch room at Scilifelab, floor 2 Probabilistic genome wide reconciliation analysis across metazoans
Gene duplication is considered to be a driving force of evolution that enables the genome of a species to acquire new functions. A reconciliation, a mapping of gene tree vertices to the edges or vertices of the species tree, explains where exactly on the species tree gene duplications occurred. The Most parsimonious reconciliation (MPR) is the reconciliation that minimizes the number of duplications. We present methods to sample reconciliations and compute most likely reconciliations of gene and species trees. The reconciliations are sampled from a posterior over reconciliations, gene trees, edge lengths along with other parameters given species tree and gene sequences. We employ a Bayesian analysis tool DLRS, based on a probabilistic model that integrates gene duplication, gene loss and sequence evolution under a relax molecular clock for substitution rates, to obtain this posterior.

We perform a genome-wide analysis of a nine species dataset and conclude that for gene families having a higher rate of duplications, the most parsimonious reconciliation is not the correct explanation of the evolutionary history. For the given dataset, we observed approximately 19% of the sampled reconciliations were not identical to MPR, which is in contrast with previous estimates, where 98% of the reconciliations were observed to be identical to MPR (Rasmussen et al. 2011). A Heatmap is also generated for the sampled finer reconciliations that map gene duplications to exact time point in the edges of species tree, which helps us understand the evolutionary history of genes and species.

Rasmussen M.D., Kellis M. A Bayesian approach for fast and accurate gene tree reconstruction. Mol. Biol. Evol. 2011;28:273-290.
Thursday April 1815:00 Kristoffer ForslundEMBL, Heidelberg
Lunch room at Scilifelab, floor 2 Country-specific antibiotic use practices impact the human gut resistome
Despite increasing concerns over inappropriate use of antibiotics in medicine and food production, population-level resistance transfer into the human gut microbiota has not been demonstrated beyond individual case studies. To determine the “antibiotic resistance potential” for entire microbial communities, we employ metagenomic data and quantify the totality of known resistance genes in each community (its resistome) for 68 classes and subclasses of antibiotics. In 252 fecal metagenomes from three countries, we show that the most abundant resistance determinants are those for antibiotics also used in animals, and for antibiotics that have been available longer. Resistance genes are also more abundant in samples from Spain, Italy and France than from Denmark, the US, or Japan. Where comparable country-level data on antibiotic use in both humans and animals are available, differences in these statistics match the observed resistance potential differences. The results are robust over time as the antibiotic resistance determinants of individuals persist in the human gut flora for at least a year.
Tuesday April 1610:30 Amin SaffariKTH, Royal Institute of Technology
Lunch room at Scilifelab, floor 2 Unique peptide-level statistics through spectral clustering
During the last few years, Tandem mass spectrometry played a major roll in analyzing and identifying protein mixtures. After digesting the protein into a mixture of peptides and randomly fragment these peptides along their backbone, the masses of the resulting peptide fragments are measured and matched against theoretically predicted spectra. The resulting peptide-spectrum matches (PSMs) can be used to infer the peptides in the protein mixture, and hence also the proteins. An interesting feature of the current functionality of mass spectrometers is that we run into several examples where multiple fragment spectrum arise from a single peptide species. This means that the error rates for unique peptides are different from the error rates of the PSMs.

As the PSMs deriving from the same peptide are not probabilistically independent, it is hard to compensate for the redundant PSMs after we have derived the error rates of the individual PSMs. Instead we propose a scheme where we reformat the observed data, e.g. the spectra, before the final processing.

We use so called spectral clustering to combine the fragment spectra, hopefully resulting in spectra with clearer fragmentation patterns and hence hopefully higher resulting matching scores between the peptide and the constructed spectrum.
Tuesday April 911:00 Simon MeridStockholm University
Lunch room at Scilifelab, floor 2 Gene network analysis to detect driver mutations in cancer
In a cancer tumor, it is hard to tell which of the somatic mutations have driven the cancer emergence and progression. Earlier approaches collected most frequently observed mutations or considered consequences at the polypeptide chain level. Our framework detects driver mutations in individual tumors via functional network analysis. First, we benchmarked different versions of the global gene network by ability to identify genes of known pathways and selected best options. Then, actual sets of somatic mutations found in glioblastoma multiforme and ovarian serous cystadenocarcinoma were analyzed with the same test. Using this procedure, novel likely drivers were detected in a number of individuals. We compare the network analysis to earlier approaches.
Tuesday April 910:30 Ino deBruijnStockholm University
Lunch room at Scilifelab, floor 2 Benchmark of de novo Short Read Assembly Strategies for Metagenomics
Metagenomics, the sequencing of environmental DNA, has demonstrated to be a promising approach for the discovery and investigation of microbes that cannot be cultured in the laboratory as well as for the study of both free-living microbial communities and microbial communities inside other organisms. In a typical shotgun metagenomics experiment the DNA of a community is isolated and high throughput sequencing is performed on a random sample of the isolated DNA. The reads can either be analyzed as such, by e.g. blast searches against reference databases to obtain a functional profile of the microbial community, or they can be assembled to form longer stretches of DNA stemming from the same or closely related organisms that can subsequently be analyzed with regards to phylogenetic affiliation and functional properties.

In our studies several strategies for de novo assembly of metagenomics have been evaluated. Illumina short read libraries have in silico been shown to work well on communities of medium complexity, therefore we have chosen to assess the assembly strategies for Illumina paired short reads specifically. In previous studies mostly in silico metagenomic data sets have been used. In contrast the community of our study is an in vitro simulated metagenome consisting of 59 species with completed or nearly completed genomes so the quality of our assessment is not dependent on the realisticness of read simulators. An even and uneven distribution of the 59 species were created in vitro. The community has been sequenced with different type of library preparations to be able to test the difference in library preparation as well. The following assembly programs have been tested: Velvet, Meta-Velvet, Newbler, Minimus2, Bambus2 and Ray. The quality of the assemblies have been evaluated by mapping the constructed contigs or scaffolds to the collection of reference genomes. In addition two pipelines have been constructed, one to perform the assemblies and another to perform the validation given there is a reference available.
Tuesday March 1910:30 Ramya SeetharamanStockholm University
Lunch room at Scilifelab, floor 2 Analysis of trans-membrane beta-barrel proteins: Insights into hydrophobicity, conservation and residue distribution
Transmembrane beta-barrel proteins (TMBs) are found in the outer membranes of Gram-negative bacteria, chloroplasts and mitochondria. They are known to perform a variety of essential functions including transport, pore formation, voltage gating, drug efflux etc to name a few. TMBs are also of immense biomedical importance as promising targets for vaccines and antimicrobial drugs. Despite this, being membrane proteins, TMBs are not amenable to structure determination procedures such as crystallography and NMR. Hence not many nonhomologous structures of this class of proteins are available. This necessitates predicting their structure and topology using computational methods.

Many of the computational structure prediction methods incorporate the specific physico-chemical properties of the TMB residues, which in turn are gathered from already resolved structures. Among other properties, hydrophobicities of residues found at different regions of the membrane, the extent to which different residues in various spatial locations are conserved and also the abundance of different residues in different regions of the biomembrane can be used effectively in computational methods for improving the efficiancy and accuracy of structure and topology prediction algorithms. Of special interest are the residues found in trimeric interfaces as these residues may play a role in the oligomerization of monomers leading to trimer formation.

In this study an analysis of the hydrophobicity, conservation and residue distribution in a dataset of both monomeric and trimeric TMBs is undertaken to clarify how these properties change across different regions of the protein, as also with different classes of residues. The significant findings offer a possibilility to better understand and predict the structure and toplogy of TMBs, and can be used to compliment conventional structural studies in TMBs.
Tuesday March 1210:30 Riccardo VicedominiUniversity of Udine, Italy
Lunch room at Scilifelab, floor 2 GAM-NGS: Genomic Assemblies Merger for NGS
In the last years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures.

To limit this problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions.

GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools.

Tuesday January 1510:30 Viktoria DorferUniversity of Applied Sciences Upper Austria (FH OÖ), Hagenberg, Austria
Lunch room at Scilifelab, floor 2 Using MS Amanda for Identifying High Resolution and High Accuracy Tandem Mass Spectra
MS Amanda is a new high-speed identification and scoring system for peptides out of tandem mass spectrometry data using a database of known proteins. This algorithm is especially designed for high resolution and high accuracy tandem mass spectra. This work is a collaboration of the Protein Chemistry Lab at the Research Institute of Molecular Pathology (IMP), Vienna, and the Bioinformatics Research Group at the University of Applied Sciences Upper Austria (FH OÖ), Hagenberg. In this presentation we introduce MS Amanda and show results achieved on various datasets in comparison to Mascot and SEQUEST.

Previous seminars at SBC: 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, and 2012.
(Link to the rolling schedule for internal SBC speakers and preliminary dates for undergraduate thesis presentations)

Other seminar series

SBC Journal Club on Protein Function
SBC Journal Club on Sequence/Structure

Biochemistry Department Seminar Series
Karolinska Seminar Series

Viktor Granholm