Article17 November 2015Open Access An experimentally supported model of the Bacillus subtilis global transcriptional regulatory network Mario L Arrieta-Ortiz Mario L Arrieta-Ortiz Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Christoph Hafemeister Christoph Hafemeister Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Ashley Rose Bate Ashley Rose Bate Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Timothy Chu Timothy Chu Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Alex Greenfield Alex Greenfield Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Bentley Shuster Bentley Shuster Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Samantha N Barry Samantha N Barry Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Matthew Gallitto Matthew Gallitto Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Brian Liu Brian Liu Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Thadeous Kacmarczyk Thadeous Kacmarczyk Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Francis Santoriello Francis Santoriello Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Jie Chen Jie Chen Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Christopher DA Rodrigues Christopher DA Rodrigues Department of Microbiology and Immunobiology, Harvard Medical School, Boston, MA, USA Search for more papers by this author Tsutomu Sato Tsutomu Sato Department of Frontier Bioscience, Hosei University, Koganei, Tokyo, Japan Search for more papers by this author David Z Rudner David Z Rudner Department of Microbiology and Immunobiology, Harvard Medical School, Boston, MA, USA Search for more papers by this author Adam Driks Adam Driks Department of Microbiology and Immunology, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, USA Search for more papers by this author Richard Bonneau Corresponding Author Richard Bonneau Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Courant Institute of Mathematical Science, Computer Science Department, New York, NY, USA Simons Foundation, Simons Center for Data Analysis, New York, NY, USA Search for more papers by this author Patrick Eichenberger Corresponding Author Patrick Eichenberger Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Mario L Arrieta-Ortiz Mario L Arrieta-Ortiz Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Christoph Hafemeister Christoph Hafemeister Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Ashley Rose Bate Ashley Rose Bate Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Timothy Chu Timothy Chu Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Alex Greenfield Alex Greenfield Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Bentley Shuster Bentley Shuster Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Samantha N Barry Samantha N Barry Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Matthew Gallitto Matthew Gallitto Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Brian Liu Brian Liu Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Thadeous Kacmarczyk Thadeous Kacmarczyk Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Francis Santoriello Francis Santoriello Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Jie Chen Jie Chen Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Christopher DA Rodrigues Christopher DA Rodrigues Department of Microbiology and Immunobiology, Harvard Medical School, Boston, MA, USA Search for more papers by this author Tsutomu Sato Tsutomu Sato Department of Frontier Bioscience, Hosei University, Koganei, Tokyo, Japan Search for more papers by this author David Z Rudner David Z Rudner Department of Microbiology and Immunobiology, Harvard Medical School, Boston, MA, USA Search for more papers by this author Adam Driks Adam Driks Department of Microbiology and Immunology, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, USA Search for more papers by this author Richard Bonneau Corresponding Author Richard Bonneau Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Courant Institute of Mathematical Science, Computer Science Department, New York, NY, USA Simons Foundation, Simons Center for Data Analysis, New York, NY, USA Search for more papers by this author Patrick Eichenberger Corresponding Author Patrick Eichenberger Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA Search for more papers by this author Author Information Mario L Arrieta-Ortiz1,‡, Christoph Hafemeister1,‡, Ashley Rose Bate1, Timothy Chu1, Alex Greenfield1, Bentley Shuster1, Samantha N Barry1, Matthew Gallitto1, Brian Liu1, Thadeous Kacmarczyk1, Francis Santoriello1, Jie Chen1, Christopher DA Rodrigues2, Tsutomu Sato3, David Z Rudner2, Adam Driks4, Richard Bonneau 1,5,6 and Patrick Eichenberger 1 1Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA 2Department of Microbiology and Immunobiology, Harvard Medical School, Boston, MA, USA 3Department of Frontier Bioscience, Hosei University, Koganei, Tokyo, Japan 4Department of Microbiology and Immunology, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, USA 5Courant Institute of Mathematical Science, Computer Science Department, New York, NY, USA 6Simons Foundation, Simons Center for Data Analysis, New York, NY, USA ‡These authors contributed equally to this work *Corresponding author. Tel: +1 212 992 9516; E-mail: [email protected] *Corresponding author. Tel: +1 212 998 8247; E-mail: [email protected] Molecular Systems Biology (2015)11:839https://doi.org/10.15252/msb.20156236 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Organisms from all domains of life use gene regulation networks to control cell growth, identity, function, and responses to environmental challenges. Although accurate global regulatory models would provide critical evolutionary and functional insights, they remain incomplete, even for the best studied organisms. Efforts to build comprehensive networks are confounded by challenges including network scale, degree of connectivity, complexity of organism–environment interactions, and difficulty of estimating the activity of regulatory factors. Taking advantage of the large number of known regulatory interactions in Bacillus subtilis and two transcriptomics datasets (including one with 38 separate experiments collected specifically for this study), we use a new combination of network component analysis and model selection to simultaneously estimate transcription factor activities and learn a substantially expanded transcriptional regulatory network for this bacterium. In total, we predict 2,258 novel regulatory interactions and recall 74% of the previously known interactions. We obtained experimental support for 391 (out of 635 evaluated) novel regulatory edges (62% accuracy), thus significantly increasing our understanding of various cell processes, such as spore formation. Synopsis A new computational framework integrating network component analysis and model selection is combined with transcriptomic datasets and generates an expanded and more accurate transcriptional regulatory network (TRN) for Bacillus subtilis. A global TRN is inferred for B. subtilis and contains 3,086 protein-coding genes, 215 transcription factors (TFs) and predicts 4,516 interactions (2,258 novel). Previously known interactions are recalled at high proportion (74%) and experimental support is provided for 1,289 TF–gene interactions (out of 1,841 tested) in transcriptional profiling data with KO strains, including 391 (out of 635) novel interactions. The inferred TRN provides novel functional insights even for well-studied pathways, such as spore formation. Introduction As cells navigate their environment, divide and differentiate, they rely on gene regulation networks. Building accurate and comprehensive models of the interactions between regulators and their target genes is essential to our understanding of basic biology and evolution in all living systems (Marbach et al, 2012). Analysis of gene regulatory networks can reveal how diverse cell processes are balanced in a single organism and facilitate genome annotation by uncovering hidden functions of co-regulated genes. Global gene regulation networks also provide information on a system's robustness and evolvability, thus influencing bioengineering strategies. In spite of the importance of having complete models, the fraction of known regulatory interactions is quite small for most species, with well-studied organisms having at best half of their genes paired with a regulator (Salgado et al, 2013). The lack of any completely characterized network stems from the fact that experimental assays are limited to measuring the consequences of gene regulation (e.g. changes in RNA or protein levels) or assessing the binding of regulators to promoters or mRNAs (Hughes & de Boer, 2013). The rate at which new regulatory interactions are identified was, however, significantly accelerated with the advent of genomic technologies (Salgado et al, 2013). Here, we focus on Bacillus subtilis, a model organism for the human pathogens Bacillus anthracis, Clostridium difficile, Listeria monocytogenes, and Staphylococcus aureus. B. subtilis was the first Gram-positive bacterium to have its genome sequenced (Kunst et al, 1997; Barbe et al, 2009) and is a major model system for competence, biofilm, and spore formation (Dubnau & Mirouze, 2013; Vlamakis et al, 2013; Cairns et al, 2014; Tan & Ramamurthi, 2014). Sporulation, in particular, is among the best-understood developmental processes in biology. The main regulators of gene expression during sporulation are σ factors, subunits of the RNA polymerase conferring DNA binding specificity to the holoenzyme (Stragier & Losick, 1990; Feklístov et al, 2014). Asymmetric division of the sporulating cell results in two cell types, a forespore that will mature into a spore and a larger mother cell. Forespore maturation depends on the mother cell for metabolic activity and synthesis of more than 70 spore coat proteins (McKenney et al, 2013). Many of the original transcriptomic studies in B. subtilis focused on gene expression during sporulation and other stress responses (Fawcett et al, 2000; Cao et al, 2002; Price et al, 2002; De Hoon et al, 2010). More recently, an expanded transcriptomic dataset was collected (Nicolas et al, 2012). An overarching goal in B. subtilis systems biology is to integrate these datasets with quantitative proteomics (Soufi et al, 2010) and analyses of metabolic fluxes (Chubukov et al, 2013) to obtain a comprehensive model of gene regulation, similar to what has been accomplished in a multi-omics investigation of the glucose to malate metabolic shift (Buescher et al, 2012). Previous regulatory network inference efforts can be divided into three main categories: (i) curation of literature and transcription factor (TF) binding sites (Freyre-González et al, 2013; Leyn et al, 2013); (ii) genetic perturbations to learn directed edges (Madar et al, 2010; Ciofani et al, 2012); and (iii) modeling regulation as a dynamic process using time series data (Bonneau et al, 2006). To reduce the complexity of the problem, regulatory networks have often been determined for TFs controlling gene clusters, instead of individual genes (Fadda et al, 2009; Lemmens et al, 2009; Waltman et al, 2010; Brooks et al, 2014; Peterson et al, 2015; Reiss et al, 2015), and metabolic pathways were integrated with regulatory networks (Oh et al, 2007; Goelzer et al, 2008; Labhsetwar et al, 2013; O'Brien et al, 2013; Carrera et al, 2014). As a whole, prior network inference studies improved prediction of the effects of genetic perturbations or the accumulation rates of metabolites under different growth conditions (Imam et al, 2015; Kim et al, 2015). In spite of early successes (Faith et al, 2007), most studies remained limited in a number of ways and often relied on heterogeneous datasets (e.g. using various microarray platforms and strain backgrounds). For instance, predicted networks for E. coli and B. subtilis were less complex than the prior known networks (i.e. these studies did not expand the known networks substantially, but instead highlighted a small focused set of new and known edges). Furthermore, in most cases, the accuracy of novel predictions was not systematically assessed in follow-up experiments. Network inference is a difficult problem because of (i) biological complexity (the activity of a transcription factor (TF) is not linearly related to its abundance); (ii) non-identifiability (biological networks are robust and thus many potential models will explain any given dataset equally well); and (iii) systematic error. Although complexity and measurement error constitute the two most often cited challenges, non-identifiability is perhaps a greater problem (Marbach et al, 2012). To address this issue, we described methods for learning regulatory networks that use prior knowledge on network structure to improve accurate identification of large networks (BBSR: Bayesian Best Subset Regression; Greenfield et al, 2013). Here, we use known interactions gathered from SubtiWiki (Michna et al, 2014) to both estimate TF activities (TFA) by employing network component analysis (NCA) (Liao et al, 2003) and constrain the model selection step of our method (BBSR-TFA). Estimated TFA and NCA have been used in previous network inference efforts. For example, chromatin immuno-precipitation (ChIP) binding data were used to estimate S. cerevisae TFA followed by correlation for target identification (Gao et al, 2004). Similarly, interactions from RegulonDB (Salgado et al, 2013) were used to determine dynamics of activities of E. coli TFs during carbon source transition (Kao et al, 2004). Following these initial applications, numerous methods to estimate TFA and to learn the regulatory network have been proposed (Boulesteix & Strimmer, 2005; Galbraith et al, 2006; Sanguinetti et al, 2006; Gu et al, 2007; Fu et al, 2011; Noor et al, 2014). These methods have in common that they model gene expression to be the result of the connectivity strength between TF–gene pairs and TF activity, where the activity is a latent variable pooling the effects of post-transcriptional and post-translational modifications. More recent applications include the identification of key TFs and their targets in mice during rapamycin treatment (Tran et al, 2010), and a regulatory network important in floral development in A. thaliana (Misra & Sriram, 2013). To our knowledge, there is only one previous application of NCA to B. subtilis data (Buescher et al, 2012). In that work, known transcriptional regulation was taken from literature, the DBTBS (Sierro et al, 2008) and SubtiWiki databases, and CcpA ChIP-chip data, to learn the regulatory network perturbed during change of carbon substrate from glucose to malate and vice versa. However, as the main focus was on metabolism, none of the 1,488 predicted interactions were assessed in follow-up experiments. In this work, we apply a unified new combination of NCA and model selection to an experimental design expressly conceived to dynamically probe the principal cellular pathways of B. subtilis, and we identify 2,258 novel regulatory interactions of unprecedented accuracy. Results and Discussion A compendium of B. subtilis transcriptional profiling data Our goal is to infer the transcriptional regulatory network (TRN) from two large transcriptomic datasets, while also incorporating previously validated TF–target gene interactions (Fig 1). These known regulatory interactions, compiled in SubtiWiki (Michna et al, 2014), represent a large body of prior work using several experimental methods to characterize functional regulatory links and are thus a powerful complement to the transcription compendium described below. We collected a global gene transcription compendium for strain PY79, a derivative of strain 168 frequently used for transcriptional profiling (Fawcett et al, 2000; Eichenberger et al, 2004; Wang et al, 2006), and obtained transcriptional profiles for 4,002 protein-coding genes from a total of 403 samples in 38 separate experimental designs (Table EV1 for strains used, Table EV2 for experimental conditions, GEO accession number GSE67023). A large fraction of the data was collected as time series, which improves our ability to infer directed edges (Bonneau et al, 2006). We investigated an entire life cycle from spore germination to sporulation (with samples collected at 30-min intervals), as well as stress responses, competence, and biofilm formation (complete results in Dataset EV1). We added a previously published dataset with 269 samples covering 104 conditions, using strain BSB1, another derivative of strain 168 (Nicolas et al, 2012). In order to incorporate informative priors on network structure, we retrieved from SubtiWiki a list of 3,040 experimentally validated regulatory interactions (Dataset EV2). Subsequently, we refer to this set of interactions as the "gold standard" (GS) or "prior network". Figure 1. General workflow for inferring the B. subtilis transcription networkTwo transcriptomic data compendia were used, one collected specifically for this study (strain PY79) and one previously published for strain BSB1 (Nicolas et al, 2012). Transcription factor activities (TFA) were estimated independently for each dataset using interactions in the gold standard (GS) extracted primarily from SubtiWiki (step 1). Datasets, estimated TFA, and priors on network structure (from the GS) were used as inputs for prediction of regulatory interactions (step 2). Next, output networks (one for each strain/dataset) were merged into a combined network (inferred TRN) (step 3) and prediction accuracy was evaluated in follow-up experiments. Download figure Download PowerPoint Estimating transcription factor activities (TFA) increases the accuracy of network inference To learn the B. subtilis TRN, we used a new combination of our Inferelator-BBSR approach (Greenfield et al, 2013), with a method for estimating transcription factor activities (TFA). Previously, gene-specific regulators were discovered based on mRNA abundance correlation, that is, gene transcription profiles were modeled as linear combinations of one or a few TF transcription profiles. With our new dataset, we observed that transcription profile is not an optimal proxy for a TF's regulatory strength (Fig 2). As a consequence, we modified the procedure by introducing an initial step where TFA are estimated based on known regulatory interactions for each experimental condition. To do so, we used NCA (Liao et al, 2003) with a simplified model of transcriptional regulation compared to previous work (Kao et al, 2004; Buescher et al, 2012) (details given in method section). Conceptually, this is similar to estimating TFA with a reporter gene, although here every known target of a TF is used as reporter and we explicitly model activation, repression, and genes under multi-TF control. These TFA were then used as predictors to learn the strength and sign of TF–gene interactions. Subsequently, predictions for each dataset were integrated into a combined network, where potential interactions were ranked by their confidence scores to provide networks that meet specific accuracy requirements (calibrated using known interactions). Figure 2. Incorporating Transcription Factor Activities (TFA) in the network inference procedure Partial Pearson correlation between mRNA transcription levels (PY79 dataset) was computed for each TF–target gene pair in the GS (top histogram). Partial Pearson correlation was also computed between the estimated activity of a TF and the transcription of its targets (bottom histogram). The advantage of estimating TFA is illustrated for three regulators. Each point corresponds to the results of one microarray experiment, and TFA are estimated for each experimental condition. Top panel: A nonlinear correlation is observed between comK transcription and transcription of ComK targets, whereas a strong linear correlation is obtained between ComK activity and transcription of ComK targets. Middle panel: No correlation is observed between codY transcription and transcription of CodY targets. CodY activity is modulated by GTP and branched chain amino acids (BCAA). A negative correlation is observed between estimated CodY activity and transcription of CodY targets. Bottom panel: Spo0A activity is modulated by phosphorylation. A better correlation is observed between Spo0A activity and transcription of Spo0A targets than between spo0A transcription and transcription of Spo0A targets. Download figure Download PowerPoint Motivation for estimating TFA In Fig 2A (top panel), we plot the partial correlation between transcription of each TF and known target gene along all conditions in the PY79 compendium. We observe that many TF–target pairs are only moderately correlated. This is expected, as a gene may be controlled by multiple regulators, while interactions of a TF with its targets are typically restricted to a subset of experimental conditions where the TF is expressed and active. We also note a high proportion of known negative interactions (repression) with positive correlation scores. This can happen when repressors work as components of negative feedback mechanisms. For instance, when a repressor takes part in an incoherent feed forward loop (FFL), expression of the repressor will start at the same time as its targets and the negative effect on target expression will only be sensed after a delay necessary for the accumulation of the repressor (Mangan & Alon, 2003; Alon, 2007). By contrast, correlations between estimated TFA and target gene transcription show fewer interactions with low correlation and better separation between activating and repressing interactions (Fig 2A bottom panel). We examined the relationships between TF activity and target gene transcription in a group of 50 TFs with at least ten experimentally validated targets (Dataset EV3). Nonlinear relationships were detected for almost all regulators, including the master regulator of competence ComK (Fig 2B top panel), which is subjected to a positive auto-regulatory loop and binds to its target promoters as dimer (Hamoen et al, 1998; Hamoen, 2003). The absence of a linear relationship between transcription of TFs and their targets was frequently noted for regulators that require co-factors, such as CodY (Sonenshein, 2007), a repressor whose activity is modulated by GTP and branched chain amino acids (Fig 2B middle panel). By contrast, for both ComK and CodY, a linear relationship is observed between TF activity and transcription of their targets. This linear relationship is a consequence of the way TFA are estimated (see 3). Considering that the Inferelator is based on a linear model (see 3), this linearization step is likely to improve the detection of additional regulatory interactions. This improvement would affect primarily TFs whose activity can be accurately estimated [i.e. those with > 10 known target genes, see below and Appendix Fig S1 (for the BSB1 data compendium) and Appendix Fig S2 (for the PY79 data compendium)]. Another major reason for discrepancies between TF transcription and target gene transcription is caused by post-translational modifications, such as the phosphorylation of response regulators in two-component systems (Salazar & Laub, 2015). A classic example is Spo0A, the master regulator of sporulation (Molle et al, 2003), which is activated by a phosphorelay (Fig 2B bottom panel). As Spo0A is both an activator and a repressor of transcription, we note that an efficient way to discriminate activated from repressed targets is to encode the sign of the interactions in the prior. Overall, estimation of TFA greatly increases the predictive power of the Inferelator. Performance of our network prediction framework Given TFA estimates, we must still select the most probable regulatory network model from the transcription data. This very large-scale model selection step also leverages the dynamical information built into our experimental design. We use TFA as predictors (and rely on transcription profiles for TFs without known targets) to infer the global TRN with our framework for regulatory network model selection (Inferelator-BBSR). We analyzed the performance of our approach by (i) assessing the recovery of known regulatory edges; (ii) assessing the robustness to noise; (iii) evaluating the experimental support for the predictions; and (iv) comparing to other network inference methods. To assess the recovery of known interactions, we compute the Area Under the Precision Recall (AUPR) curve (Fig 3A) for this recovery task under both settings (recovery of known and learning of new interactions); AUPR has a value of 1 when all GS interactions rank top of the list and close to 0 for random predictions. In addition to the Inferelator, we also evaluated CLR (Faith et al, 2007) and Genie3 (Huynh-Thu et al, 2010), two state-of-the-art network inference methods. In this analysis, GS interactions were only used for TFA estimation and we did not incorporate prior information during the model selection step of the Inferelator. A random set of 50% of the GS interactions were used for TFA estimation, while the remaining GS interactions were used to calculate precision and recall. We note that (i) a higher score is obtained for the combined network than for networks derived from each dataset independently; (ii) scores for the predicted networks are significantly higher when TFA is used; and (iii) although priors were not used to influence model selection, the Inferelator has the highest AUPR among the compared methods. Figure 3. Performance of network inference methods when incorporating TFA Precision–Recall plot of the confidence-ranked interaction networks using CLR, Genie3, and the Inferelator (no priors). Solid lines show performance using TFA. Dashed lines show performance when no TFA are used (when raw expression values for TFs are used as predictors). The numbers superimposed on each curve indicate the area under the curve. Performance of BBSR-TFA (AUPR: area under precision recall curve) on the combined, BSB1 and PY79, networks in the presence of false prior information. 50% of the edges in the GS are used as true priors, and various amounts of random edges are added. Performance is evaluated on the leave-out set of interactions. Each point represents the median of five random samples of 50% of the GS set. Support from KO data for the models predicted by BBSR-TFA, Genie3 (G3), CLR, and a consensus method (META) that rank combines the prediction of the three methods. Methods were used without and with TFA (TFA tag). The number on top of each bar indicates the proportion of evaluated interactions with KO support for the corresponding method. Left and right panels show the support for each method when all interactions (recovered priors and novel interactions) and only novel interactions are considered, respectively. Download figure Download PowerPoint To determine the stability of the estimated TFA, we examined the