Tuesday September 10 | 11:00 | Walter Basile | Stockholm University |
Lunch room at Scilifelab, alpha floor 2 | ORF conservation and de-novo creation in the yeast Saccharomyces cerevisiae | ||
The yeast Saccharomyces cerevisiae was one of the first organisms to be fully sequenced. At the time of that study, it was noted how for a significant percentage of described ~6000 ORFs it was not possible to find a homolog in any other species. It was at first believed that with more sequenced genomes available the number of those "orphan" genes would decrease, but that is not the case. It is now accepted that more than 12% of all S. cerevisiae genes are orphans, with ~20% of them being species-specific. We conducted a large scale study of the yeast genome, by analyzing each ORF through a computational pipeline aimed at finding the syntenic region in other 17 fully sequenced fungal species and 12 S.cerevisiae strains. We used a clustering approach to divide the gene set into 8 groups with different level of conservation, and for each group we described the most prominent intrinsic properties as well as experimental evidences. Finally, a phylogenetic tree is proposed, that summarizes gene gains and losses across the entire fungal clade. | |||
Tuesday September 3 | 10:30 | Michael Y. Galperin | NCBI, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA |
Lunch room at Scilifelab, floor 2 | Genomic and biogeochemical clues to the origin of line | ||
In the past, origin of life on Earth has been treated mostly as a philosophical problem with little
connection to everyday biological research. Even after the possibility of abiotic origin of amino
acids and nucleic acid bases had been demonstrated in 1953, there has been no agreement on
the energy source(s) for the formation of increasingly complex biopolymers (redox or thermal
gradients, UV, atmospheric electricity, etc.), the driving force(s) leading to the emergence of the
first life forms (natural selection vs spontaneous self-organization), their properties (RNA-based vs
metabolism-based, autotrophic vs heterotrophic, etc.), or place of origin (deep sea vs fresh water).
The availability of genomic data for diverse bacteria, archaea, and eukaryotes, including various
extremophiles, allowed us to take a new look at this problem. By identifying the common genome
core of all (known) living organisms, and the shared properties of their cells, it has become possible
to deduce simple and reasonable biogeochemical constraints on the conditions that led to the origin
of life and to get an insight on where it has happened and how. In turn, these reconstructions lead
to new questions that can now be addressed experimentally, bringing the whole enterprise into the
realm of "normal" science. The most surprising result of these studies is the growing impression that
the origin of life has been a natural consequence of the geochemical conditions that existed on the
primordial Earth, rather than a one-time improbable accident. Mulkidjanian AY and Galperin MY (2009) Biol Direct 4:27. PMID:19703275 Mulkidjanian AY et al. (2012) Proc Natl Acad Sci USA 109:E821. PMID: 22331915 | |||
Tuesday June 11 | 10:30 | Teepo Niinimäki | University of Helsinki, Helsinki, Finland |
Lunch room at Scilifelab, floor 2 | Treedy: A Heuristic for Counting and Sampling Subsets | ||
Consider a collection of weighted subsets of a ground set N. Given a query subset Q of N, how fast can one (1) find the weighted sum over all subsets of Q, and (2) sample a subset of Q proportionally to the weights? We present a tree-based greedy heuristic, Treedy, that for a given positive tolerance d answers such counting and sampling queries to within a guaranteed relative error d and total variation distance d, respectively. Experimental results on artificial instances and in application to Bayesian structure discovery in Bayesian networks show that approximations yield dramatic savings in running time compared to exact computation, and that Treedy typically outperforms a previously proposed sorting-based heuristic. | |||
Monday June 10 | 14:00 | Gabriele Orlando | Master thesis, Stockholm University |
Lunch room at Scilifelab, floor 2 | Large-scale prediction of GPCR structures | ||
G protein-coupled receptors (GPCRs) are involved in many biological processes and are one of the most important families of drug targets. However, only a few of their structures are known and the way drugs bind to their orthosteric sites is still largely unknown. Accurate modeling of GPCR binding sites could allow the design of new drugs, using in silico screening. The goal of this project is to build a program that automatically infers the structures of GPCRs by homology and evaluates their ability to discriminate the active ligands from a pool of random drug-like molecules. We tested the program by modeling the structures of the dopamine and the histamine receptors and evaluated each model by docking a pool of known active ligands taken from the ChEMBL database mixed with random druglike molecules. We also explored novel methods that could increase the accuracy of the models based on sampling of the conformations of the side chains in the orthosteric site. We found that some of our methods can significantly increase the quality of the models. | |||
Tuesday June 4 | 10:30 | Satish Nair | Stockholm University |
Lunch room at Scilifelab, floor 2 | Investigating polyG repeat variation between individual Single cells | ||
The advancement in the technology of massive parallel sequencing have helped us to obtain more information about the DNA. Together with the development of new softwares in the Bioinformatics field, we are now able to get reliable and efficient variable calling, from the data obtained from the massive parallel Sequencing. It has been studied that the polyG regions in the DNA are more prone to replication errors when compared to the other homopolymer repeats but importantly more variable than non homopolymer regions. We think these mutations at the polyG regions could contain useful information which could lead us to find the relation between the single cells. In this project we are using the NGS data from the BGI paper containing 20 tumor Single cell samples, 5 Normal Single cell samples and 1 Normal tissue sample. A pipeline was designed which is more suited for the variant calling of the homopolymer repeats using Bowtie2 and Gatk for mapping and variant calling repectively. A phylogentic analysis was made based on the SNP's as well as INDEL's. Since it is not possible to use the classical phylogeny approach for polyG indels an inhouse script in python was designed to calculate the indicative distance which is a relative distance measure between the two single cells. A bootstrap analysis have been done to see the reliability of the data obtained. The analysis clearly shows that the cancer cells and the Normal cells cluster seperately with good bootstrap values for both SNP's and INDEL's. These results indictaes that the variation in polyG repeats could be used to find how the Single cell samples are related to each other. This analysis has been done on the exome sequences which normally have less mutation rates. If we are able to cluster the cancer and normal samples with the variants available from the exome sequencing, then we could get a better and more specific phylogenetic analysis from whole genome sequencing data. | |||
Tuesday June 4 | ~11:00 | Daniele Raimondi | Stockholm University |
Lunch room at Scilifelab, floor 2 | Master thesis presentation: Deep learning ensemble methodology for direct information contact prediction | ||
Recently, several new contact prediction methods have been published.
They use (i) large sets of multiple aligned sequences (ii) and assume that
correlations between columns in these alignments can be the results of
residue interactions and thus clues of residues spatial proximity in the
native structure. These methods are clearly superior to earlier methods
when it comes to predicting contacts in proteins. PconsC [2] has been
developed by Marcin J. Skwark and combines predictions from two direct
information methods, PSICOV [4] and plmDCA [3], and two alignment
methods, HHblits and jackHmmer, at four different e-value thresholds,
obtaining an improvement of the predictive performances with respect to
the single methods on which it is based. The aim of this thesis project
was to further improve the quality of these predictions. To achieve this goal, I developed a Deep Learning architecture able of performing structured predictions, taking into consideration the significant amount of information underlying the contact prediction problem instead of simply considering each residue pair independent from the others (in [1] has been shown how contacts in the native structure can hardly involve a single pair of residues). I implemented a multilayer learner using Random Forest classiffers that improves contact predictions by being able to abstract some typical inter/multi residue relationships among neighbouring residue pairs, namely by learning how to recognize frequent visual patterns (mainly Secondary Structure features, such as alfa-helices and beta-sheets) in the contact maps. This abstraction ability can relocate the most uncertain predictions into the recognized patterns, reconstructing them and thus improving significantly the precision of the overall contact map. This Deep Learning approach, along with some additional features (e.g. predicted Secondary Structure and predicted Relative Solvent Accessibility) can provide a further 20% improvement of PconsC predictive performances. References [1] Pietro Di Lena, Ken Nagata and Pierre Baldi, Deep architectures for protein contact map prediction Vol. 28 no. 19 2012, pages 2449 2457 BIOINFOR- MATICS doi:10.1093/bioinformatics/bts475 [2] Marcin J. Skwark, Abbi Abdel-Rehim and Arne Elofsson, PconsC: Combi- nation of direct information methods and alignments improves contact pre- diction, Bioinformatics (2013) doi: 10.1093/bioinformatics/btt259 [3] Ekeberg, M., Lovkvist, C., Lan, Y., Weigt, M., and Aurell, E. (2013). Im- proved contact prediction in proteins: Using pseudolikelihoods to infer Potts models, Phys Rev E Stat Nonlin Soft Matter Phys, 87(1-1), 012707. [4] Jones, D., Buchan, D., Cozzetto, D., and Pontil, M. (2012), PSICOV: pre- cise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments, Bioinformatics, 28(2), 184 190. | |||
Thursday May 23 | 14.00 | Rune Linding | Technical University of Denmark, Lyngby, Denmark |
Lunch room at Scilifelab, floor 2 | Biological Forecasting and Cancer Kinome Networks | ||
Biological systems are composed of highly dynamic and interconnected molecular networks that drive
biological decision processes. The goal of network biology is to describe, quantify and predict the
information flow and functional behaviour of living systems in a formal language and with an accuracy
that parallels our characterisation of other physical systems such as Jumbo-jets. Decades of targeted
molecular and biological studies have led to numerous pathway models of developmental and disease
related processes. However, so far no global models have been derived from pathways, capable of
predicting cellular trajectories in time, space or disease. The development of high-throughput
methodologies has further enhanced our ability to obtain quantitative genomic, proteomic and phenotypic
readouts for many genes/proteins simultaneously. Here, I will discuss how it is now possible to derive
network models through computational integration of systematic, large-scale, high-dimensional
quantitative data sets. I will review our latest advances in methods for exploring phosphorylation
networks. In particular I will discuss how the combination of quantitative mass-spectrometry,
systems-genetics and computational algorithms (NetworKIN [1] and NetPhorest [4]) made it possible
for us to derive systems-level models of JNK and EphR signalling networks [2,3]. I shall discuss work
we have done in comparative phospho-proteomics and network evolution[5-7]. Finally, I will discuss our
most recent work in analysing genomic sequencing data from NGS studies and how we have developed new
powerful algorithms to predict the impact of disease mutations on cellular signaling networks [8,9]. References: http://www.lindinglab.org Linding et al., Cell 2007. Bakal et al., Science 2008. Jorgensen et al., Science 2009. Miller et al., Science Signaling 2008. Tan et al., Science Signaling 2009. Tan et al., Science 2009. Tan et al., Science 2011. Creixell et al., Nature Biotechnology Sep 2012. Erler & Linding, Cell May 2012. | |||
Tuesday May 14 | 10:30 | Muhammad Owais Mahmudi | KTH, Royal Institute of Technology |
Lunch room at Scilifelab, floor 2 | Probabilistic genome wide reconciliation analysis across metazoans | ||
Gene duplication is considered to be a driving force of evolution that enables the genome of a
species to acquire new functions. A reconciliation, a mapping of gene tree vertices to the edges
or vertices of the species tree, explains where exactly on the species tree gene duplications occurred.
The Most parsimonious reconciliation (MPR) is the reconciliation that minimizes the number of
duplications. We present methods to sample reconciliations and compute most likely reconciliations
of gene and species trees. The reconciliations are sampled from a posterior over reconciliations, gene
trees, edge lengths along with other parameters given species tree and gene sequences. We employ a
Bayesian analysis tool DLRS, based on a probabilistic model that integrates gene duplication, gene
loss and sequence evolution under a relax molecular clock for substitution rates, to obtain this posterior. We perform a genome-wide analysis of a nine species dataset and conclude that for gene families having a higher rate of duplications, the most parsimonious reconciliation is not the correct explanation of the evolutionary history. For the given dataset, we observed approximately 19% of the sampled reconciliations were not identical to MPR, which is in contrast with previous estimates, where 98% of the reconciliations were observed to be identical to MPR (Rasmussen et al. 2011). A Heatmap is also generated for the sampled finer reconciliations that map gene duplications to exact time point in the edges of species tree, which helps us understand the evolutionary history of genes and species. References: Rasmussen M.D., Kellis M. A Bayesian approach for fast and accurate gene tree reconstruction. Mol. Biol. Evol. 2011;28:273-290. | |||
Thursday April 18 | 15:00 | Kristoffer Forslund | EMBL, Heidelberg |
Lunch room at Scilifelab, floor 2 | Country-specific antibiotic use practices impact the human gut resistome | ||
Despite increasing concerns over inappropriate use of antibiotics in medicine and food production, population-level resistance transfer into the human gut microbiota has not been demonstrated beyond individual case studies. To determine the “antibiotic resistance potential” for entire microbial communities, we employ metagenomic data and quantify the totality of known resistance genes in each community (its resistome) for 68 classes and subclasses of antibiotics. In 252 fecal metagenomes from three countries, we show that the most abundant resistance determinants are those for antibiotics also used in animals, and for antibiotics that have been available longer. Resistance genes are also more abundant in samples from Spain, Italy and France than from Denmark, the US, or Japan. Where comparable country-level data on antibiotic use in both humans and animals are available, differences in these statistics match the observed resistance potential differences. The results are robust over time as the antibiotic resistance determinants of individuals persist in the human gut flora for at least a year. | |||
Tuesday April 16 | 10:30 | Amin Saffari | KTH, Royal Institute of Technology |
Lunch room at Scilifelab, floor 2 | Unique peptide-level statistics through spectral clustering | ||
During the last few years, Tandem mass spectrometry played a major roll in analyzing
and identifying protein mixtures. After digesting the protein into a mixture of peptides
and randomly fragment these peptides along their backbone, the masses of the resulting
peptide fragments are measured and matched against theoretically predicted spectra. The
resulting peptide-spectrum matches (PSMs) can be used to infer the peptides in the protein
mixture, and hence also the proteins. An interesting feature of the current functionality
of mass spectrometers is that we run into several examples where multiple fragment spectrum
arise from a single peptide species. This means that the error rates for unique peptides are
different from the error rates of the PSMs. As the PSMs deriving from the same peptide are not probabilistically independent, it is hard to compensate for the redundant PSMs after we have derived the error rates of the individual PSMs. Instead we propose a scheme where we reformat the observed data, e.g. the spectra, before the final processing. We use so called spectral clustering to combine the fragment spectra, hopefully resulting in spectra with clearer fragmentation patterns and hence hopefully higher resulting matching scores between the peptide and the constructed spectrum. | |||
Tuesday April 9 | 11:00 | Simon Merid | Stockholm University |
Lunch room at Scilifelab, floor 2 | Gene network analysis to detect driver mutations in cancer | ||
In a cancer tumor, it is hard to tell which of the somatic mutations have driven the cancer emergence and progression. Earlier approaches collected most frequently observed mutations or considered consequences at the polypeptide chain level. Our framework detects driver mutations in individual tumors via functional network analysis. First, we benchmarked different versions of the global gene network by ability to identify genes of known pathways and selected best options. Then, actual sets of somatic mutations found in glioblastoma multiforme and ovarian serous cystadenocarcinoma were analyzed with the same test. Using this procedure, novel likely drivers were detected in a number of individuals. We compare the network analysis to earlier approaches. | |||
Tuesday April 9 | 10:30 | Ino deBruijn | Stockholm University |
Lunch room at Scilifelab, floor 2 | Benchmark of de novo Short Read Assembly Strategies for Metagenomics | ||
Metagenomics, the sequencing of environmental DNA,
has demonstrated to be a promising approach for the
discovery and investigation of microbes that cannot be
cultured in the laboratory as well as for the study of
both free-living microbial communities and microbial
communities inside other organisms. In a typical shotgun
metagenomics experiment the DNA of a community is isolated
and high throughput sequencing is performed on a random sample
of the isolated DNA. The reads can either be analyzed as such,
by e.g. blast searches against reference databases to obtain a
functional profile of the microbial community, or they can be assembled
to form longer stretches of DNA stemming from the same or closely related
organisms that can subsequently be analyzed with regards to phylogenetic affiliation and functional properties. In our studies several strategies for de novo assembly of metagenomics have been evaluated. Illumina short read libraries have in silico been shown to work well on communities of medium complexity, therefore we have chosen to assess the assembly strategies for Illumina paired short reads specifically. In previous studies mostly in silico metagenomic data sets have been used. In contrast the community of our study is an in vitro simulated metagenome consisting of 59 species with completed or nearly completed genomes so the quality of our assessment is not dependent on the realisticness of read simulators. An even and uneven distribution of the 59 species were created in vitro. The community has been sequenced with different type of library preparations to be able to test the difference in library preparation as well. The following assembly programs have been tested: Velvet, Meta-Velvet, Newbler, Minimus2, Bambus2 and Ray. The quality of the assemblies have been evaluated by mapping the constructed contigs or scaffolds to the collection of reference genomes. In addition two pipelines have been constructed, one to perform the assemblies and another to perform the validation given there is a reference available. | |||
Tuesday March 19 | 10:30 | Ramya Seetharaman | Stockholm University |
Lunch room at Scilifelab, floor 2 | Analysis of trans-membrane beta-barrel proteins: Insights into hydrophobicity, conservation and residue distribution | ||
Transmembrane beta-barrel proteins (TMBs) are found in the outer membranes of Gram-negative
bacteria, chloroplasts and mitochondria. They are known to perform a variety of essential functions
including transport, pore formation, voltage gating, drug efflux etc to name a few. TMBs are also
of immense biomedical importance as promising targets for vaccines and antimicrobial drugs.
Despite this, being membrane proteins, TMBs are not amenable to structure determination
procedures such as crystallography and NMR. Hence not many nonhomologous structures of this
class of proteins are available. This necessitates predicting their structure and topology using
computational methods. Many of the computational structure prediction methods incorporate the specific physico-chemical properties of the TMB residues, which in turn are gathered from already resolved structures. Among other properties, hydrophobicities of residues found at different regions of the membrane, the extent to which different residues in various spatial locations are conserved and also the abundance of different residues in different regions of the biomembrane can be used effectively in computational methods for improving the efficiancy and accuracy of structure and topology prediction algorithms. Of special interest are the residues found in trimeric interfaces as these residues may play a role in the oligomerization of monomers leading to trimer formation. In this study an analysis of the hydrophobicity, conservation and residue distribution in a dataset of both monomeric and trimeric TMBs is undertaken to clarify how these properties change across different regions of the protein, as also with different classes of residues. The significant findings offer a possibilility to better understand and predict the structure and toplogy of TMBs, and can be used to compliment conventional structural studies in TMBs. | |||
Tuesday March 12 | 10:30 | Riccardo Vicedomini | University of Udine, Italy |
Lunch room at Scilifelab, floor 2 | GAM-NGS: Genomic Assemblies Merger for NGS | ||
In the last years more than 20 assemblers have been proposed to tackle
the hard task of assembling NGS data. A common heuristic when assembling
a genome is to use several assemblers and then select the best assembly
according to some criteria. However, recent results clearly show that
some assemblers lead to better statistics than others on specific
regions but are outperformed on other regions or on different evaluation
measures. To limit this problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions. GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools. | |||
Tuesday January 15 | 10:30 | Viktoria Dorfer | University of Applied Sciences Upper Austria (FH OÖ), Hagenberg, Austria |
Lunch room at Scilifelab, floor 2 | Using MS Amanda for Identifying High Resolution and High Accuracy Tandem Mass Spectra | ||
MS Amanda is a new high-speed identification and scoring system for peptides out of tandem mass spectrometry data using a database of known proteins. This algorithm is especially designed for high resolution and high accuracy tandem mass spectra. This work is a collaboration of the Protein Chemistry Lab at the Research Institute of Molecular Pathology (IMP), Vienna, and the Bioinformatics Research Group at the University of Applied Sciences Upper Austria (FH OÖ), Hagenberg. In this presentation we introduce MS Amanda and show results achieved on various datasets in comparison to Mascot and SEQUEST. |