SBC seminars 2012

Wednesday February 2910:30Francesco VezziPostdoc, Lars Arvestad's group
Lunch room at SciLifeLab floor 2Evaluating De Novo Sequence Assembly

The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics, read simulation, and/or presence of a reference sequence. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes.

Despite its many advantages FRC lacks a stand alone implementation, it can be computed on an relatively small amount of assemblers, and it is not able to handle the large amount of data generated High Throughput Sequencers (HTS).

In this seminar we will discuss the main problems related to available assembly validation techniques. We will focus our attention on Feature Response Curve analyzing the correlation among different features. Moreover we will show the first results of a new tool able to compute FRC without limitations on the assembler type and on the assembled organisms.


Vezzi F, Narzisi G, Mishra B (2012) Feature-by-Feature: Evaluating De Novo Sequence Assembly. PLoS ONE 7(2): e31002. doi:10.1371/journal.pone.0031002

Narzisi G, Mishra B (2011) Comparing De Novo Genome Assembly: The Long and Short of It. PLoS ONE 6: e19175.

Slides available HERE

Tuesday March 1310:30Roland NilssonComputational Medicine
Lunch room at SciLifeLab floor 2 Discovering gene function through large-scale analysis of gene expression data
As whole-genome expression analysis has become increasingly affordable, the amount of microarray data available in public repositories has increased rapidly in recent years. I will argue that published analyses have barely scratched the surface of the enormous information content in these data repositories, and given specific questions, systematic, large-scale analysis can unearth new biology. We have previously exemplified this strategy by analyses that expand known co-expressed metabolic pathways to discover new factors in heme biosynthesis and in regulation of oxidiative phosphorylation. Recently, we have concentrated on optimizing the computational methods involved using extensive pre-calculation to enable analyses of thousands of data sets on time scales of seconds, in order to make these techniques accessible to a wider audience. Ongoing developments include new methods for discovering function of single genes without any prior knowledge, and for predicting causal genes within genetically associated loci.
Tuessday April 310:00Thomas Schmitt
Lunch room at SciLifeLab floor 2 Orthology prediction and network inference.
In my halftime seminar I will present our work in the fields of orthology prediction and the construction and analysis of genome scale functional coupling networks. This includes the continuous improvements in the InParanoid database (, the development of exchange formats for sequence and orthology information (, link inference based on topological network properties, and our work on the functional coupling prediction framework FunCoup ( Finally I will bring everything together and show the importance of orthology predictions for network inference and the identification of conserved sub-networks.
Wednesday April 1810:30Hossein Shahrabi Farahani
Lunch room at SciLifeLab floor 2 Increased A-to-I editing of microRNAs during development
MicroRNAs are small non-coding RNAs that function as post-transcriptional regulators by binding to, more or less, complementary target sequences in mRNA. A microRNA typically has a large set of target mRNAs that it either degrades or suppresses translation of. Adenosine to inosine (A-to-I) RNA editing is a co- or post-transcriptional processing event that converts adenosine to inosine within double-stranded RNA. Inosine is read as guanosine (G) by the cellular machineries. We use high throughput RNA sequencing to determine editing levels in mature miRNA from the mouse transcriptome. A read with an A:G mismatch with a microRNA cannot a priori be categorized as an edited version of the microRNA. The read can, for example, be another known or even unknown microRNA which its sequences only has one A:G mismatch to the first microRNA. Sequencing errors add further complications of this type. We devised novel methods to address the issue of identifying such false positives. The main focus of this work is the rate of A-to-I editing during development. For the first time, we here show that the level of editing increase with development, thereby indicating a regulatory role for editing during brain maturation.
Wednesday May 210:30Joel Sjostrand
Lunch room at SciLifeLab floor 2Reconciling gene trees and species trees
Over the last decade, the Bayesian approach has gained in popularity in phylogenetics. One reason for this is the possibilities it provides of creating more realistic and complex models of evolution. Of particular interest in recent years has been the interplay between a homologous gene family and the corresponding species tree. Although these are highly intertwined, key evolutionary mechanisms such as gene duplication, loss, and horizontal transfer will effectively create discordances that may be difficult to resolve. In this talk, I will discuss some of our current methods for simultaneously inferring and reconciling a gene tree with a species tree, and presents some related results from eubacteria, invertebrates, and vertebrates.
Wednesday May 1610:30Per Kraulis
Lunch room at SciLifeLab floor 2Web Services at SciLifeLab
Web Services (WS) is an established mode of distributed computing, which is used extensively in bioinformatics and related areas. I will discuss the main architectural principles for WS that have emerged during the last decade, with a focus on so-called RESTful Web Services. I will give some examples of how this is relevant to the current and future activities at SciLifeLab. In our setting, integration of different datasets and computational approaches is a very important challenge. How can we leverage Web Services to integrate and publish our resources in the best way? I will highlight the technological aspects as well as the policy implications for SciLifeLab.
Friday May 2510:30Martin Blom
Lunch room at SciLifeLab floor 2Prioritizing candidate disease genes by using prior information
Network "guilt by association" (GBA) is a proven approach for identifying novel disease genes based on the observation that similar mutational phenotypes arise from functionally related genes. However, classical GBA is not an ideal fit for genome-wide association studies (GWAS), where many genes are somewhat implicated, but few are known with very high certainty. I resolve this by explicitly modeling the uncertainty of the associations and incorporating the uncertainty for the seed set into the GBA framework. I will also talk briefly about phenotypes in model species as source of prior information independent from gene interactions, and show how they can be combined with functional gene information to find candidate disease genes.
Tuesday May 2910:30Adina Howe: post-doctoral researcher in Titus Brown's lab at Michigan State University
Lunch room at SciLifeLab floor 2
The development of next-generation short-read sequencing technologies has allowed us to sequence soil microbial communities to unprecedented depths. We now have extremely large soil metagenomes, which because of their numbers and short read lengths cannot be analyzed with traditional genomic tools. A de novo metagenomic assembly approach significantly reduces the size of data for analysis and does not rely on the availability of reference genomes. However, de novo assembly is challenged by extremely high sequence diversity, uneven sequencing coverage, sequencing errors and biases, and the availability of large computational resources. We have developed novel approaches to enable soil metagenome assemblies through the removal of sequencing errors and biases, data reduction, and scalable assembly graph representations. We initially reduce the size of the dataset by normalizing the average coverage of the soil metagenome using an approach we term “digital normalization.” We eliminate redundant, high-coverage short-reads within a dataset using a single-pass, constant memory algorithm. This normalization reduces the metagenome dataset size so that it has an increasingly even distribution of read-coverage. Comparing assemblies before and after digital normalization for an E. coli genome (50x coverage normalized to 5x), we found that assemblies were > 99% similar despite eliminating 90% of the reads. Similar results were observed for a subset of a soil metagenome. Next, we were able to partition several metagenomic datasets into millions of disconnected assembly subgraphs using a probabilistic data structure based on Bloom filters. Among these subgraphs, we consistently found a single, dominant partition consisting of 5 to 76% of metagenomic reads. Characterizing the sequences and connectivity within this dominant partition, we identified position-specific biases within sequenced reads suggesting the presence of spurious connectivity within metagenomes. Using a systematic traversal algorithm, we could identify and remove highly connecting sequences from this partition and subsequently repartition the remaining sequences. We found that the filtering of these sequences not only removes potential sequencing artifacts but also improves assemblies (as demonstrated in simulated datasets) and breaks apart the largest partition allowing for scalable assembly. Applying this partitioning approach to a soil metagenome (30 million reads), we decreased assembly memory requirements by 8-fold. In conclusion, we have developed approaches that can be applied to the assembly of the growing amounts of soil metagenomic sequencing data. Our approach results in numerous smaller datasets which can be analyzed and/or assembled independently (with separate parameters) and in parallel and subsequently combined into a final assembly for a metagenome.. Furthermore, many of our methods can be extended and applied to other sequence analyses (i.e. transcriptomes).
Monday August 2710:30Ashley Teufel (PhD student of David Liberles), Uni of Wyoming
Lunch room at SciLifeLab floor 2Models of gene fate post duplication
Duplicated genes provide a multitude of potential paths for evolution to act upon and as such are a major source of evolutionary novelty. Differential loss rates of genes post a duplication event can give insight to the ultimate fate of a gene. Several processes have been described which lead to duplicate gene retention/loss after both smaller scale and whole genome duplication events, including neofunctionalization, subfunctionalization, nonfunctionalization and dosage balance. Previous work has characterized expectations of duplicate gene retention under the neofunctionalization and subfunctionalization models. We have extended this work to include dosage balance and introduce a generalized survival model. This model has been constructed based on the distinct retention/loss patterns of the differential mechanisms of gene retention and verified via simulations which employed a network based model of gene loss among the interacting and duplicated partners. The survival model distinguishes gene fate based on loss rate by employing a time heterogeneous hazard function. Using this model a simulated gene tree can be constructed within a given species tree for which dN/dS values consistent with the state of redundancy and selection are employed. The dS value acts as a proxy for time and the simulation increments across this timescale allowing for both gene birth and death events to occur at each increment. Using INDELible (Fletcher, W. and Yang, Z. 2009) sequences are then simulated across this gene tree using branch dependent substitution models which correspond to the genes mechanism of retention.
Tuesday September 410:30 Kristoffer Sahlin
Lunch room at SciLifeLab floor 2Distance estimation of unknown sequence in genome assembly and structural variation detection
Motivation: One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. Current models provide contig distance estimates that are generally strongly biased and based on false assumptions. Another scenario where distance estimates from paired reads is important is structural variation detection. The paired read distances are both used for indicating an occurrence of an indel and the estimate of the size for this structural variant. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance. Results: We show that state-of-the-art programs for scaffolding are using an incorrect model of distance estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap between two contigs, and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection, and for library insert-size estimation as is commonly performed by read aligners.
Tuesday September 1114:00 Lukasz Huminiecki
Lunch room at SciLifeLab floor 22R and remodeling of vertebrate signal transduction engine.

BACKGROUND: Whole genome duplication (WGD) is a special case of gene duplication, observed rarely in animals, whereby all genes duplicate simultaneously through polyploidisation. Two rounds of WGD (2R-WGD) occurred at the base of vertebrates, giving rise to an enormous wave of genetic novelty, but a systematic analysis of functional consequences of this event has not yet been performed.

RESULTS: We show that 2R-WGD affected an overwhelming majority (74%) of signalling genes, in particular developmental pathways involving receptor tyrosine kinases, Wnt and transforming growth factor-β ligands, G protein-coupled receptors and the apoptosis pathway. 2R-retained genes, in contrast to tandem duplicates, were enriched in protein interaction domains and multifunctional signalling modules of Ras and mitogen-activated protein kinase cascades. 2R-WGD had a fundamental impact on the cell-cycle machinery, redefined molecular building blocks of the neuronal synapse, and was formative for vertebrate brains. We investigated 2R-associated nodes in the context of the human signalling network, as well as in an inferred ancestral pre-2R (AP2R) network, and found that hubs (particularly involving negative regulation) were preferentially retained, with high connectivity driving retention. Finally, microarrays and proteomics demonstrated a trend for gradual paralog expression divergence independent of the duplication mechanism, but inferred ancestral expression states suggested preferential subfunctionalisation among 2R-ohnologs (2ROs).

CONCLUSIONS: The 2R event left an indelible imprint on vertebrate signalling and the cell cycle. We show that 2R-WGD preferentially retained genes are associated with higher organismal complexity (for example, locomotion, nervous system, morphogenesis), while genes associated with basic cellular functions (for example, translation, replication, splicing, recombination; with the notable exception of cell cycle) tended to be excluded. 2R-WGD set the stage for the emergence of key vertebrate functional novelties (such as complex brains, circulatory system, heart, bone, cartilage, musculature and adipose tissue). A full explanation of the impact of 2R on evolution, function and the flow of information in vertebrate signalling networks is likely to have practical consequences for regenerative medicine, stem cell therapies and cancer treatment.

Tuesday September 2510:30 Sara Light
Lunch room at SciLifeLab floor 2The evolution or protein domain repeats
Protein domain repeats are evolutionarily related units that occur in tandem within a protein. These are stretches of domains from the same family, situated next to each other in a protein. Certain properties characterize these domains. First, they are often quite short, often less than 50 residues, and, second, they tend to be highly variable with only a few residues that are crucial for the functionality of the domain. Structurally, repeat domains are diverse and may form modular structures on their own or form larger filaments where each repeat is dependent on other repeats for its functionality. Their sequences are malleable, both with regard to the repeating unit and in the number of repeats, and they therefore provide flexible binding to many partners. We have studied two actin-binding proteins, namely the muscle protein Nebulin and the cytoskeleton crosslinker Filamin. These two proteins constitute two very different examples of protein domain repeat expansions. The former evolves through additions of seven domains at a time, while the latter has undergone expansions of variable size in some invertebrate lineages. Further, while repeat domains are fairly uncommon among prokaryotes, we have characterized their abundance in these organisms, finding that some species have an overabundance of repeating domains. Finally, when we classify all repeat domains according to events that may be detected through pairwise alignments of domains, we find that among the domains that are common in long repeat proteins, sushi and spectrin domains evolve primarily through cassette tandem duplications while scavenger and immunoglobulin repeats appear to evolve through clustered tandem duplications. Additionally, immunoglobulin and filamin repeats exhibit a unique pattern where roughly every other domain shows high sequence similarity. This pattern may be the result of tandem duplications, serve to avert aggregation between adjacent domains or it may be the result of functional constraints.
Tuesday October 0210:30Nanjiang Shu
Lunch room at SciLifeLab floor 2Topology variation in membrane protein families
The topology of integral membrane proteins are generally considered to be conserved within a protein family, following the concept that structure is more conserved than sequence. However, recent studies show that topology variations such as internal gene duplications, insertions/deletions of transmembrane (TM) helices and even inverted topology are not extremely rare. Deep analysis of these topology variations are lacking due to the limited number of solved membrane protein structures. By taking the advantages of accurate membrane topology predictors, we attempt to address the frequency and nature of these changes in more detail. Using the topology predicted by twelve different topology predictors, we compared 150,000 pairs of homologous membrane proteins from 1055 Pfam clans/families. We show that the fraction of pairs with variations decreases as sequence identity increases. In total, about 6% of all pairs are predicted with inverted topology, indicating that dual topology may exist extensively in many protein families. Moreover, TM helices aligned to gaps (TM2GAP) are more frequent at N-C-terminals than in the middle of the sequence. On the contrary, TM helices aligned to non-TM regions (TM2SEQ) occur more often in the middle of the sequence. Furthermore, for TM helices aligned to gaps, it is more frequent to find an even number of helices than an odd number. This may indicate that evolution prefers the overall position of the membrane protein not change, since inserting or deleting an even number of TM helices will keep the position of the rest of the membrane protein unchanged in cell membrane.
Tuesday October 0910:30 Dan Larhammar
Lunch room at SciLifeLab floor 2Evolution of vertebrate gene families - the impact of genome doublings
The evolution of gene families is usually investigated by sequence comparisons and construction of phylogenetic trees. For gene families that underwent many duplications in a short period of time, it is often difficult to deduce the order of the duplications. This may be due to uneven evolutionary rates or differential losses of duplicates in different evolutionary lineages. An additional approach has turned out to be useful: comparison of chromosomal locations for genes. Comparisons of synteny between species can distinguish species homology (orthology) from gene duplications (paralogy). During the past several years, it has become clear that massive gene duplications took place as a result of two genome doublings (tetraploidizations) in the vertebrate ancestor over 500 million years ago. One more tetraploidization took place in the ancestor of true bony fishes some 300 million years ago, and several additional tetraploidizations have happened in various vertebrate lineages. This explains why goldfish and salmon have many more genes than humans. By combining sequence comparisons with chromosome comparisons, we have been able to deduce the evolutionary history of numerous problematic gene families, including neuropeptides, receptors (such as the opiate/endorphin receptors), growth hormone and ion channels.
Tuesday October 1610:30 Alexandru Ioan Tomescu
Lunch room at SciLifeLab floor 2Polynomial Time Algorithms for Estimating Transcript Expression with RNA-Seq on Gene Graphs with Some Bounded Parameters
Recent RNA-Seq technology allows for new high-throughput ways for isoform identification and quantification, and various methods have been put forward for this non-trivial problem. In this talk, we propose to interpret the problem as finding the paths which best explain, under a least squares model, the coverages in an exon chaining graph. Aligning RNA-sequencing reads to the genome results into coverage values for exons and plausible splice variants. These coverage values can be assigned as weights in an exon chaining graph G=(V,E), where nodes V are exons and edges E are the splice variants. An RNA transcript candidate is a path from an exon (node) s belonging to V containing start codon to an exon (node) t belonging to V containing an end codon. We study the problem of finding k transcripts (paths) from s to t, each associated with an expression level, such that they together best explain the coverages (weights) of the exons (nodes) and splice variants (edges). We give a dynamic programming algorithm to find the best paths and associated expression levels, such that the algorithm works in polynomial time assuming constant limit for k, for maximum degree in G, and for expression levels. We also show that the problem is NP-hard in general. Experimental results on prediction accuracy show that our method is very competitive as it provides better precision and recall on stringent conditions on prediction accuracy than popular tools such as Cufflinks and IsoLasso.
Tuesday October 2310:30Yoshinori Fukasawa
Lunch room at SciLifeLab floor 2Our novel mitochondrial targeting signal predictor, MoiraiSP and its features.
1000-1500 different proteins are estimated to localize in mitochondria, however numerous mitochondrial proteins remain undiscovered. Prediction of mitochondrial targeting signal is an efficient approach when identifying undiscovered mitochondrial proteins. A cleavable N-terminal presequence is the best characterized mitochondrial targeting signal; about half of known mitochondrial proteins possess a presequence, Mitochondrial proteins with presequence are imported into mitochondria via the translocase and then the presequence is cleaved off by mitochondrial processing protease (MPP) and intermediate proteases in the matrix. However the detail mechanisms remain unclear. Moreover the data of experimentally identified presequences was limited. Thus, current predictors cannot produce sufficient performances in presequence and cleavage site prediction. Fortunately, large scale proteomic analyses of presequence were recently performed in yeast and plant. This proteomic data is useful for improvement of prediction. In this work, we therefore developed a predictor for presequences and their cleavage sites trained on recent proteomic data as well as amino acid composition, physico-chemical properties and import receptor recognition motif. We furthermore performed novel motif search and generated profiles for cleavage sites of MPP and intermediate proteases, and then integrated them into our prediction. Our predictor attains better performances than the present predictors. We especially achieved a significant performance improvement in cleavage site prediction. In fact, prediction of cleavage site shows better performance with comparing to a standard tool in this field, TargetP: our novel predictor named MoiraiSP predicts about 71% of canonical cleavage sites and TargetP does about 54%. The results indicate that, having the advantage of a large training dataset for cleavage site, MoiraiSP makes more accurate predictions than previous methods. Thus our method is valuable for finding candidates of undiscovered mitochondria proteins and their signal regions.
Tuesday October 3010:30Francesco Vezzi
Lunch room at SciLifeLab floor 2ERNE-BS5: aligning BS-treated sequences by multiple hits on a 5-letters alphabet
Cytosine methylation is a DNA modification that has great impact on the regulation of gene expression and important implications for the biology and health of several living beings, including humans. Bisulfite conversion followed by next generation sequencing (BS-seq) of DNA is the gold standard technique used to detect DNA methylation at single-base resolution on a genome scale through the identification of 5-methylcytosine (5-mC). However, by converting unmethylated cytosines into thymines, BS-seq poses computational challenges to read alignment and aggravates the issue of multiple hits due to the ambiguity raised by thecreduced sequence complexity. In this seminar we present ERNE-BS5 (Extended Randomized Numerical alignEr - BiSulfite 5), an aligning program developed to efficiently map BS-treated reads against large genomes (e.g., Human). To achieve this goal we have implemented three different ideas: (i) we use a 5-letters alphabet for storing methylation information, (ii) we use a weighted context-aware Hamming distance to identify a T coming from an unmethylated C context, and (iii) we use an iterative process to position multiple-hit reads starting from a preliminary map built using single-hit alignments. The map is corrected and extended at each cycle using the alignments added in the previous iteration. ERNE-BS5 is based on a new improved version of the rNA aligning software with a more efficient core.
Tuesday November 610:30Hossein Farahani
Lunch room at SciLifeLab floor 2Learning cancer progression networks with graphical models
Cancer is a result of accumulation of different types of genetic mutations such as copy number aberrations. The data from tumors are cross-sectional and do not contain the temporal order of the genetic events. Finding the order in which the genetic events have occurred and progression pathways are of vital importance in understanding the disease. In order to model cancer progression, we propose Progression Networks (PNs), a special case of Bayesian networks (BNs), that are tailored to model disease progression. We also describe learning algorithms for learning Bayesian networks in general and progression networks in particular. We reduce the hard problem of learning the Bayesian and progression networks to Mixed Integer Linear Programming (MILP). MILP is a NP-complete problem for which very good heuristics exists. We introduce three algorithms. The first algorithm learns PNs from complete data with bounded number of parents for each node in the learned PN. In a PN, time to perform inference grows exponentially with respect to the tree-width of the network. As a result developing an algorithm for learning PNs with bounded tree-width is necessary if we intend to perform inference over the learned PN. The second algorithm learns such PNs. There are always experimental errors involved in discovering the aberrations. Our third algorithm is a global structural EM algorithm for learning the PNs from incomplete data to deal with such errors. The performance of the algorithms is tested on synthetic data and real cytogenetic data from renal cell carcinoma.
Tuesday November 2010:30David Drew
Lunch room at SciLifeLab floor 2Revealing the secrets of ion-coupled transport
Transmembrane gradients are harnessed by secondary transporters to drive the uptake of ions and molecules across the cell. Secondary transporters are found in every species from all kingdoms of life. In humans they carry out diverse functions such as the intestinal absorption of peptides, cholesterol and sugars, and they also transport neurotransmitters into synaptic vesicles. Secondary transporters are the targets for many theureuptics, such as serotonin re-uptake inhibitors (antidepressents), and they often play a major role in drug pharmacokinetics. Understanding the mechanisms by which secondary transporters shuttle and move ions, drugs, and natural compounds across membranes is of fundamental importance. Because of the technical difficulties in working with membrane proteins our structural understanding is still limited. Here, I will present the latest structural insights, biochemistry and MD simulations on the transport mechanism for members harbouring the NhaA fold, a family we know very little about. I will also present new methods for improving the likelihood of obtaining membrane protein structures.
Tuesday November 27 10:30Bjorn Sponberg and Ganapathi Varma
Lunch room at SciLifeLab floor 2 Sponberg: A new model argues against Just in time synthesis in M phase
Sponberg abstract To clarify as many details as possible in the human cell cycle is of great medical interest. It is an initiative to eliminate cancer disease. A new regulatory model for the eukaryotic cell cycle has been launched. The new model builds on three evolutionary principal arguments, which combined, will lead to the alternative regulatory solution it suggests. More specifically, the model hypothesizes that as the chromosomes enter their stressful states in M phase, transcripts will be supplied from P-body vesicles - instead of from Just in time synthesis. For this reason, these special mRNA's are referred to as P-body transcripts. Consequently, the P-body transcripts must be synthesized and transported into P-bodies, before M phase in the cell cycle. According to the new model, this happens just after the restriction point in G1 phase. 460 genes from the Human cell cycle were downloaded into a list and ranked phase-chronologically, from G1 to M phase. By joining the beginning of the list (G1 phase genes) with its end (M phase genes), a turntable sequence-wheel that represented the human cell cycle was created. By gradually turning it around, the wheel became a tool to systematize the hunt for the hypothesized P-body genes. Three analytic tools were used in the search; SVMlight, Blastall and UTRscan. The final results supported; 1) the possible existence of the hypothesized P-body genes, 2) their suspected placement in the cell cycle - as predicted in the new model, 3) the P-body genes are probably temporally regulated via three gene regions; three prime untranslated region (3'UTR), five prime untranslated region (5'UTR) and introns and 4) Musashi binding element (MBE) and Upstream Open Reading Frame (uORF) are potential regulatory candidates in the P-body transcripts, which can control their unique temporal storage and release, via cytosolic P-bodies. In conclusion, these results can lead to further investigations in the wet-lab to clarify if, and how, the new model functions in vivo.
Ganapathi: Benchmarking the next generation of homology inference tools
Ganapathi abstract Over the past decade, available genome and proteome sequences have been growing enormously and stacking in databases. Application of bioinformatics tools for searching query sequences against databases provides alternatives to wet-lab analysis. Recently, advanced techniques have been employed to allow search tools to detect remote homologies and also to infer functional features of these sequences. In this work, we tested the variations in performance among these methods. Large scale benchmarks applicable to all the search tools are still not readily available for biologists. Here we evaluated five widely used next-generation homology inference tools including PSI-BLAST, CS-BLAST, USEARCH, HHSEARCH and PHMMER. For comparison purposes, traditional NCBI-BLAST and FASTA search tools were used. We generated challenging benchmark datasets based on three different domain databases PFAM, SCOP/Superfamily and CATH/Gene3D, restricted to twenty-nine evolutionarily diverged species. By using homology as a criterion we tested the variations among different protein domain classification databases. From each benchmark dataset all-against-all pairwise comparisons were performed, and based on the obtained e-values we computed ROC curves. Results reveal that the PHMMER search tool significantly outperformed the other methods with high AUC scores. We also measured the performances of all methods with e-value cutoff 10-3, our analysis indicate that PHMMER yields the highest mean precision score (99.9%) over the three datasets. Furthermore, to verify the differences we conducted a two-way ANOVA tests and pair-wise t-tests; these tests showed that there is a statistically significant difference in performance, when varying methods and databases. Overall, these advanced improvements in PHMMER enhanced the ability to infer potential homologs and increased the precision scores without reducing the computation speed.
Monday December 0310:30Megan Owen
Lunch room at SciLifeLab floor 2 Statistics in the Space of Phylogenetic Trees
Shor Bio Megan Owen is a researcher at David R. Cheriton School of Computer Science at the University of Waterloo in Canada. Megan current research focuses on using geometric spaces to represent, study, and analyze phylogenetic trees and networks, as well as tree-shaped data arising from medical imaging. She is also interested in mathematical and statistical problems related to understanding these spaces.
We introduce new notions of mean and variance for a set or distribution of phylogenetic trees. These definitions of mean and variance are analogous to those for a weighted set of points in Euclidean space, but with the underlying space being the space of phylogenetic trees constructed by Billera, Holmes, and Vogtmann (2001). A property of this space (non-positive curvature) ensures there is a unique shortest path between any two trees. Furthermore, this path can be computed in polynomial time, leading to a practical algorithm for computing the mean and variance. I will compare the mean and variance to existing consensus tree and summary methods, as well as present applications to such biological problems as the reconstruction of phylogenetic trees and the automatic labelling of lung airway scans.
Tuesday December 0410:30Emil Kolbeck
Lunch room at SciLifeLab floor 2Protein domain versatility scoring methods
Protein domains are modules of conserved protein structure, which is used in many types of studies in evolutionary proteomics and neighboring fields. The Pfam database contains a large set of protein domains constructed using Hidden Markov Models. There has been several attempts to define a metric for “versatility” or “promiscuity” of protein domains. These methods have used different approaches towards finding and ranking domains that are present and abundant in a large variety of proteins. Here an attempt has been made to summarize and compare these methods. The methods has been applied to the latest version of the Pfam database and compared using the Spearman and Jaccard distance metrics. By enriching GO terms hypergeometrically, an attempt has been made to get a more objective way of identifying the similarities and differences between the methods in terms of biological functions and processes. The results show that the methods have a weak but significant similarity between each other. The GO terms enriched among the versatile domains show a bias towards central regulatory and metabolistic/catabolistic pathways and key enzymes in all kingdoms. One deviation from the enrichment results are the DVI method in Eukaryota, which show a bias towards membrane processes in the enriched terms. In conclusion it should be possible to modify one of the existing methods or create a new method for finding versatile domains that show a more consistent association with certain biological functions or processes.
Tuesday December 1810:30 Oliver Serang
Lunch room at SciLifeLab floor 2 A generic graphical method for using empirical null distributions in the robust evaluation of discoveries
Paired empirical null data are ubiquitous: they can come samples from a patient who doesn't have a disease, from mice injected with saline, or by generating *omics data from epi tubes filled with water. Here a novel graphical nonparametric Bayesian method is presented, which harnesses heterogeneous null data and can use them to evaluate discoveries or to even choose hyperparameters that are commonly thought to be inestimable (e.g. the FDR threshold). In particular, the method is demonstrated by applying it to mass spectrometry-based proteomics data, for which there is little ground truth available.

Previous seminars at SBC: 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, and 2011.
(Link to the rolling schedule for internal SBC speakers and preliminary dates for undergraduate thesis presentations)

Other seminar series

SBC Journal Club on Protein Function
SBC Journal Club on Sequence/Structure

Biochemistry Department Seminar Series
Karolinska Seminar Series

Dave Messina