Combining Genome-Scale Experimental and Computational Methods To Identify Essential Genes in Rhodobacter sphaeroides

Knowledge about the role of genes under a particular growth condition is required for a holistic understanding of a bacterial cell and has implications for health, agriculture, and biotechnology. We developed the Tn-seq analysis software (TSAS) package to provide a flexible and statistically rigorous workflow for the high-throughput analysis of insertion mutant libraries, advanced the knowledge of gene essentiality in R. sphaeroides, and illustrated how Tn-seq data can be used to more accurately identify genes that play important roles in metabolism and other processes that are essential for cellular survival.

In recent years, genome-enabled experimental and computational approaches have been developed to link genes to cellular function or phenotype. Transposon sequencing (Tn-seq) is one such experimental approach, combining transposon mutagenesis with next-generation sequencing (NGS) to permit genome-wide identification of insertion sites in a library of mutants (10)(11)(12)(13). Monitoring either the frequency of transposon insertions in a library under a single condition or changes in mutant abundance between conditions allows one to identify essential genes and the fitness contributions of nonessential genes on a genome-wide scale.
Genome-scale metabolic network reconstructions are computational frameworks, wherein genes and gene products are assigned to reactions in one or more metabolic pathways (14). The resulting genome-scale metabolic network model (GEM) can be predictive of growth, metabolite and end product accumulation, and gene essentiality. However, the predictive power of these, and other genome-scale approaches, is often limited by lack of knowledge about proteins that catalyze reactions in a pathway or the role of uncharacterized gene products under one or more conditions. Recent efforts have utilized essential gene analysis to improve the accuracy of GEMs in both eukaryotic and cyanobacterial model systems (15,16).
In this study, we used Tn-seq to identify essential genes in R. sphaeroides under several growth conditions, thus establishing relationships between genes and phenotypes for bacterial genes of known and unknown function. To aid this study, we developed software that provides a streamlined and statistically rigorous workflow to identify the contributions of genes in a Tn-seq mutant library. These Tn-seq data were then used to evaluate and refine an existing genome-scale metabolic network reconstruction, with improvements to the resulting GEM. By comparing Tn-seq data to predictions of the R. sphaeroides metabolic model iRsp1140 (17,18), we confirmed roles for many genes and assigned potential functions to previously uncharacterized genes. Our findings provide new insight into the lifestyles of this bacterium and illustrate how a systems-level understanding of cells can be improved by combining genome-scale experimental and computational approaches.

RESULTS
Workflow for Tn-seq data analysis. By leveraging the random insertion of transposons into a genome, Tn-seq can identify essential genes and the fitness contributions of nonessential genes on a genomic scale (10)(11)(12)(13). This assumes that transposon insertions are often detrimental to gene function and therefore that loci that are important for growth under a given condition are less likely to harbor these genetic lesions. By analyzing the number of transposon insertions, or the frequency of DNA sequence reads mapping to these insertions, one can determine at some confidence level the likelihood that a given locus is important for growth. A number of software packages have been developed for analysis of Tn-seq data, with several of these packages designed to analyze Tn-seq data derived from mariner-based transposons, taking advantage of their thymine-adenine (TA) insertion site requirement (19)(20)(21). To improve the ability of researchers to analyze a variety of Tn-seq libraries, we developed a statistically rigorous, flexible, and streamlined software package that generates lists of important genes using NGS data of transposon libraries. Our Tn-seq analysis software (TSAS) utilizes the Java Statistical Classes application programming interface to enable rapid analysis of aligned DNA sequence reads to test the likelihood of a gene's importance under a given condition. The TSAS workflow (Fig. 1) is organism agnostic and requires only (i) genome sequence (in FASTA format), (ii) genome coordinates (in GFF v3 format), and (iii) aligned DNA sequence reads (in Bowtie, SOAP, or Eland format).
In the first phase of the TSAS workflow, the aligned reads are used to identify and quantify transposon insertion sites across the genome. Genome coordinates are then used to determine gene-level insertion and read counts. To facilitate data visualization, the insertion location and abundance data are summarized in wiggle (WIG) format files, which can be viewed using publicly available genome browsers. Once the data for each gene have been summarized, TSAS can assess condition-dependent gene essentiality based either on a theoretical distribution of the insertions in a given sample (onesample analysis) or on the use of a reference data set (two-sample analysis), as outlined below.
One-sample analysis. When using the one-sample analysis option in TSAS, the total number of insertions per gene is compared to a binomial distribution, which assumes that each base pair in the genome has an equal likelihood of a transposon insertion event. The use of a binomial distribution enables one to assess the probability of having a specific number of insertions within a locus of specified length (22). TSAS uses this probability as a proxy for the likelihood that disruption of a gene has a nonneutral impact on growth under a given condition ( Fig. 1; see Text S1 [TSAS User Guide] in the supplemental material). Using a binomial distribution mitigates potential overestimation of the importance of small genes, which may have low numbers of insertions merely due to their size (see Materials and Methods). Furthermore, by using the number of unique insertions, instead of the total number of sequence reads per gene, TSAS bypasses artifacts associated with preparation of NGS libraries that can artificially inflate the frequency of DNA sequence reads at some insertion sites (23)(24)(25). The output for a one-sample TSAS analysis provides separate adjusted P values to assess genes having fewer insertions than expected (i.e., transposon insertions result in decreased relative fitness) or more insertions than expected (i.e., transposon insertion results in improved relative fitness) (see Text S1 [TSAS User Guide]).
Two-sample analysis. When DNA sequence data from a library grown under experimental and reference conditions are available, the user can run a two-sample TSAS analysis. In this case, the number of mapped sequence reads per gene (normalized, and with an optional bias correction) for each condition is used to calculate fold changes and associated P values (Fig. 1 Mutant library construction and validation. We generated an R. sphaeroides Tn5 mutant library via conjugation with a diaminopimelic acid (DAP)-auxotrophic, pRL27containing Escherichia coli donor strain (WM6439). Tn5 was chosen for this work based on the high guanine-cytosine (GC) content of R. sphaeroides (69% GC), which precluded the use of TA dinucleotide sequence-specific mariner-based transposons (26). Use of the WM6439 donor strain allowed for selection of kanamycin-resistant transconjugants on a rich medium (Luria-Bertani [LB]), thereby increasing the potential diversity of Tn5 insertion sites in the master library. Colonies from master library plates were pooled, aliquoted, and stored at Ϫ80°C at~1.2 ϫ 10 10 CFU/ml. As a quality assessment, we recovered transposon/genome junctions from two aliquots of this library using the Tn-seq circle method (27) and analyzed the data in a one-sample analysis with TSAS, finding~200,000 unique insertion sites in each library aliquot ( Table 1). The number of mapped DNA sequence reads per gene and insertions per gene were highly correlated between aliquots, demonstrating that we reproducibly recover transposon/genome junctions by this method (Fig. S1).
Identifying genes required for growth on LB. We performed a one-sample analysis to identify genes essential for aerobic growth on LB. We identified 839 genes with fewer insertions than expected, based on the size of the genes and the number of insertions in the library (FWER Ͻ 0.01). Of these, 493 contained fewer than four unique insertions in the central 80% of the gene (see Materials and Methods) and were classified as essential for aerobic growth on LB (Table S1). As an example, Fig. 2 shows the insertion and read count data for the genomic region surrounding several essential genes.
We used the Clusters of Orthologous Groups of Proteins (COG) categories (28) to get an overview of the R. sphaeroides essential genes (Fig. 3). Nine COG categories are overrepresented in the list of essential genes (P value Ͻ 0.05), including many fundamental biological processes. The COG category corresponding to "Translation, Ribosomal Structure, and Biogenesis" had the greatest representation, with 111 of 271 (41%) genes present in this list.
We queried the Database of Essential Genes (DEG) (29) with the 489 protein-coding essential genes, using BLASTP to identify homologs in the bacterial database. Of these 489 genes, 431 had a putative homolog in DEG (expect threshold Ͻ 1 ϫ 10 Ϫ10 ; score Ͼ 500 [ Table S2]). Thus, 88% of the genes essential for aerobic growth of R. sphaeroides on LB agree with gene essentiality predictions for one or more of the 39 bacteria in DEG. The 58 R. sphaeroides genes without a homolog in DEG, including 12 that encode hypothetical proteins, represent candidates for previously uncharacterized genes that are essential for growth on LB. For instance, in the analysis of these 58 genes, four Gene Ontology (GO) categories related to vitamin B 12 (cobalamin) biosynthesis or transport (GO:0009236, cobalamin biosynthetic process; GO:0015420, cobalamin-transporting ATPase activity; GO:0015889, cobalamin transport; and GO:0035461, vitamin transmembrane transport) are enriched (P value Ͻ 0.0001), suggesting that the biosynthesis, transport, or need for cobalamin is different in R. sphaeroides than in the other bacteria in the DEG database, at least under the conditions tested.   The properties of existing mutants can also be used as an independent test of the data derived from TSAS analysis of the transposon mutant library. For instance, RSP_1519 (prrC), a gene encoding a copper chaperone involved in capture and delivery of copper to cytochrome c oxidases (30) and also shown to have thiol-disulfide oxidoreductase activity (31), was found among the list of 489 essential genes. Deletion of prrC leads to increased colony pigmentation in the presence of O 2 , but loss of this gene is not reported to have a significant effect on aerobic growth on a defined medium (32). Thus, the inability to recover prrC mutants in the master library suggests that there is a previously unreported defect in aerobic growth on LB. The designation of prrC as an essential gene may also illustrate a caveat of Tn-seq, whereby slowgrowing strains can erroneously be classified as essential. For example, the 14 subunits of NADH dehydrogenase (nuoA to -N) are identified by Tn-seq as essential when R. sphaeroides cells are grown aerobically on LB. While the presence of several of these subunits in the DEG database lends credence to their classification as essential in R. sphaeroides, strains lacking RSP_2521 (nuoG) have been observed to grow approximately twice as slow as wild-type cells under aerobic conditions in several media (33).
Identifying genes required for growth on a minimal medium. The utility of TSAS was also evaluated with a two-sample analysis to identify conditionally essential genes for aerobic growth in Sistrom's minimal medium (SMM), which lacks exogenous amino acids and contains succinate and ammonium as the sole carbon and nitrogen sources. In this case, we grew the master library aerobically in SMM for approximately seven doublings by optical density at 600 nm (OD 600 ) before preparing genomic DNA for sequencing. We identified 105 genes with fewer reads after aerobic growth in SMM compared to growth on LB (log fold change Ͻ Ϫ2; adjusted P value [proportions_insertions] Ͻ 0.05). Of these, 85 contained fewer than four unique insertions in the central 80% of the gene and were classified as conditionally essential for aerobic growth in SMM (Table S3; examples in Fig. 4). Since SMM lacks amino acids, we expected that many of the 85 genes identified as conditionally essential in this experiment would encode biosynthetic enzymes. Indeed, the GO terms "cellular amino acid biosynthetic process" (GO:0008652; 30 of 58 genes; P value ϭ 8.02 ϫ 10 Ϫ41 ), "purine nucleotide biosynthetic process" (GO:0006164; 10 of 14 genes; P value ϭ 2.7 ϫ 10 Ϫ17 ), and "pyrimidine nucleotide biosynthetic process" (GO:0006221; 6 of 10 genes; P value ϭ 9.4 ϫ 10 Ϫ11 ) are overrepresented in 46 of these 85 genes. In addition, most of the other 39 genes that are identified as essential for aerobic growth in SMM are annotated as encoding biosynthetic functions.

Tn-seq in Rhodobacter sphaeroides
Included in the remaining genes identified as conditionally essential for aerobic growth in SMM are four that encode hypothetical proteins. When we generated a markerless, in-frame deletion mutant for one of these genes (RSP_1566), we found that this strain was able to grow aerobically on LB plates but was unable to grow aerobically on SMM plates (Fig. 5), demonstrating that the Tn-seq analysis accurately identified conditional essentiality for this gene. This gene product is predicted to contain a CbiX/SirB chelatase domain that often functions in siroheme or vitamin B 12 synthesis (34), suggesting that RSP_1566 plays a previously unknown role in inserting metal into sirohydrochlorin under aerobic conditions. Another hypothetical protein identified as essential for aerobic growth in SMM, RSP_6139, is likely to be involved in amino acid biosynthesis since it is potentially cotranscribed with genes encoding the two subunits of the isopropylmalate isomerase (leuD and leuC) in the leucine biosynthetic pathway. Two other genes identified as essential for aerobic growth in SMM, RSP_1147 and RSP_1148, are potentially cotranscribed with neighboring genes encoding the large and small subunits of glutamate synthase (gltB and gltD, respectively), suggesting that these hypothetical proteins also play a previously unrealized role in amino acid biosynthesis.
Identifying genes required for photosynthetic growth. To identify genes required for photosynthetic growth, we grew the library photosynthetically for seven doublings by OD 600 in SMM before determining which genes had undergone selection. In a two-sample analysis (using aerobic growth on LB as a control), we identified 146 genes with fewer reads after photosynthetic growth in SMM (log fold change Ͻ Ϫ2; adjusted P value [proportions_insertions] Ͻ 0.05). Of these, 91 had fewer than four insertions in the central 80% of the gene and are considered conditionally essential for photosynthetic growth in SMM. In order to remove the effect of growth in a minimal medium, we excluded genes essential for aerobic growth in SMM (see above), leaving 31 genes required for photosynthetic growth (Table S4). Of these 31 genes, 14 are found in the so-called photosynthetic gene cluster (PGC), a group of genes known to encode components of the photosynthetic apparatus in this and related purple nonsulfur bacteria (35) (Fig. 6). The transcriptional regulator CbbR, responsible for regulating genes of the reductive pentose phosphate pathway (Calvin cycle) (36)(37)(38), was also classified as conditionally essential for photosynthetic growth. Several genes not previously identified as having a role in photosynthesis, such as RSP_2885 (glgA; glycogen synthase), RSP_0465 (putative protease), and RSP_2947 (glutamate racemase), were identified as conditionally essential in this analysis. In addition, four genes encoding hypothetical proteins were also classified as essential for photosynthetic growth (Table S4). Future experiments are needed to test the role and function of these genes in photosynthetic growth. Using Tn-seq data to refine a metabolic model. iRsp1140 is a GEM for R. sphaeroides 2.4.1 consisting of 878 metabolites, 1,416 reactions, and 1,137 genes, covering about 25% of the predicted open reading frames (18,39). Essential genes can be predicted with iRsp1140 by systematic in silico deletion of individual genes followed by assessment of whether the model predicts growth (40,41). To compare essential genes identified with Tn-seq to those predicted by iRsp1140, we performed separate onesample analyses of the Tn5 library grown aerobically in SMM and photosynthetically in SMM.
A one-sample analysis of the library grown aerobically in SMM identified 501 essential genes (FWER Ͻ 0.01; Ͻ4 insertions in the central 80% of the gene), of which 289 are included in iRsp1140 (Table S5). In parallel, in silico simulations with iRsp1140 predicted 208 single-gene deletions that result in no aerobic growth in SMM. A comparison of the two data sets (Fig. 7A) shows that the TSAS and iRsp1140 analyses overlap in 132 essential genes (P value ϭ 2.9 ϫ 10 Ϫ40 , hypergeometric test). When we used a similar scheme to compare Tn-seq data and iRsp1140 essential gene predictions for photosynthetic growth in SMM (Fig. 7B), we again found significant but incomplete overlap (144 genes; P value ϭ 6.6 ϫ 10 Ϫ41 , hypergeometric test). Because the libraries were grown in SMM, there is considerable overlap between aerobic and photosynthetic analyses when considering TSAS-specific genes, iRsp1140-specific genes, and the intersect of essentiality predicted by TSAS and iRsp1140. For example, 64 of the 83 genes predicted to be essential for photosynthetic growth in SMM by iRsp1140 alone were also included in the list of iRsp1140-specific essential genes for aerobic growth in SMM.
The genes called essential either by TSAS or by iRsp1140 alone could provide insight into modifications that would improve iRsp1140's predictive ability. For instance, if a gene is identified as essential by TSAS but not iRsp1140, and the reaction catalyzed by the gene is known to be essential, then the set of isozymes assigned to the reaction in the metabolic model may need to be reduced. This was the case with RSP_2172 {metF, 5,10-methylenetetrahydrofolate reductase [NAD(P)]} and RSP_2769 (putative 5,10methylenetetrahydrofolate reductase), both associated in iRsp1140 with an essential reaction for tetrahydrofolate biosynthesis. Because only RSP_2172 was identified as essential by TSAS, the gene-protein-reaction (GPR) rule for this reaction was modified to include only RSP_2172, thus making the essentiality for this reaction agree between TSAS and iRsp1140. For other reactions catalyzed by genes called essential by TSAS that were not essential in iRsp1140, we investigated if whole pathways should be added to iRsp1140 or made essential through the addition of the products of these reactions to the biomass equation. This resulted in the addition of vitamin B 6 and vitamin B 12 to the biomass equation for growth in SMM, with whole pathways added for anaerobic synthesis of vitamins B 6 via 5-amino-1-(5-phospo-D-ribosyl)imidazole (42) and vitamin B 12 via hydroxypyruvate (43)(44)(45)(46).
Another approach to increase compatibility between TSAS and iRsp1140 predictions was to reevaluate reaction reversibility in selected pathways. For instance, iRsp1140 allowed the use of amino acid degradation pathways for both turnover and de novo synthesis of amino acids, thus resulting in predictions of nonessentiality for genes in their biosynthetic pathways. This was corrected using BioCyc (43,44) to guide modification of reaction reversibility, where necessary. In another example, iRsp1140 allowed for aerobic production of gluconeogenesis intermediates using the anaerobic Calvin cycle, resulting in many gluconeogenesis reactions being predicted as nonessential. To correct this, we modified model constraints, turning off known anaerobic growthspecific pathways by setting flux through these pathways to zero when growing aerobi-

FIG 7
Comparing TSAS-and iRsp1140-predicted essential genes. We used TSAS to identify genes essential for aerobic and photosynthetic growth in Sistrom's minimal medium and compared these to iRsp1140 essential gene predictions (aerobic [A] and photosynthetic [B]). We then updated iRsp1140 based on inspection of the genes that were predicted to be essential by either Tn-seq or TSAS alone, resulting in iRsp1140_opt. These changes to the model resulted in increased agreement with TSAS-predicted essential genes for aerobic (C) and photosynthetic (D) growth.
cally. Also, when growing photosynthetically (anaerobically in the light), iRsp1140 could use oxygen for synthesis of some essential biomass compounds. We again modified model constraints, turning off pathways for oxygen-dependent porphyrin synthesis and vitamin B 6 synthesis during anaerobic growth simulations. We also examined genes called essential by iRsp1140 but not by TSAS. In these cases, modifying GPR rules for essential reactions was difficult as there were often no known alternative enzymes capable of performing this function. We did, however, modify GPRs for several of the tRNA synthesis genes to match the essentiality calls from TSAS analysis. Another change based on the Tn-seq data was to remove the carotenoid spheroidenone from the aerobic biomass equation. This change alone supported removal of six of the iRsp1140-only genes, as the genes encoding enzymes responsible for spheroidenone biosynthesis are no longer considered essential.
In total, by integrating the Tn-seq data with iRsp1140, 10 reactions were deleted, while 21 reactions, 12 metabolites, and 5 genes were added. Reaction reversibility was modified for 17 reactions, and GPRs were modified for 87 reactions (see Table S6 for a complete list of changes). These changes resulted in a new instance of iRsp1140, called iRsp1140_opt, which showed increased overlap with genes that are identified as essential by Tn-seq (Fig. 7). For aerobic conditions, the number of overlapping genes increased from 132 (36%) to 220 (65%) (P value ϭ 3.7 ϫ 10 Ϫ123 , hypergeometric test), while the number of overlapping genes increased from 144 (37%) to 225 (61%) (P value ϭ 3.2 ϫ 10 Ϫ110 , hypergeometric test) for photosynthetic conditions. In sum, these refinements highlight the utility of comparing genome-scale data sets to metabolic network reconstructions as a way to improve metabolic network models.

DISCUSSION
In this study, we show how genetic, genomic, and computational approaches can be used to increase our knowledge of the roles of individual genes under different growth conditions. New advances and insights derived from this work are discussed below.
Software for analyzing Tn-seq data. In recent years, Tn-seq studies have identified essential genes in a wide variety of bacteria, using different experimental and computational approaches (10-13, 27, 47-49). Here, we introduced a new Tn-seq data analysis package, TSAS, which is approachable (requiring no additional programs [aside from Bowtie] and minimal command line interaction) and flexible (able to handle both oneand two-sample analyses) and provides the users with tools to perform a statistically rigorous analysis of their data. In this paper, we have highlighted the use of TSAS to identify essential and conditionally essential genes using one-and two-sample analyses, respectively. While not explored in-depth here, TSAS is also capable of uncovering genes with positive or negative fitness contributions. Aside from those genes considered essential, genes with fewer insertions than expected (one-sample analysis) or that undergo negative selection after outgrowth (fewer reads, two-sample analysis) represent genes that contribute in a positive manner to the fitness of the organism. Conversely, genes with more insertions than expected (one-sample analysis) or that undergo positive selection after outgrowth (more reads, two-sample analysis) represent those that contribute in a negative manner to the fitness of the organism. Researchers may find either set of genes useful during strain design (for a particular phenotype) or when insertion of foreign DNA into a nonnecessary part of the genome is required (insertion of a synthetic operon, for example).
Identifying essential genes with TSAS. We used TSAS to identify, for the first time, genes that are essential for aerobic growth of R. sphaeroides on LB. With 429 of the 489 protein-encoding genes identified as essential in R. sphaeroides having a homolog in the DEG database, we propose that there is high confidence in the lists of essential R. sphaeroides genes. When we compared the 489 genes to 522 genes of Rhodopseudomonas palustris (a closely related purple nonsulfur bacterium) recently reported to be essential for growth in rich medium (9), there were 322 in common. In a comparison of the 489 R. sphaeroides genes identified as essential in a rich medium with the 480 essential genes in the alphaproteobacterium Caulobacter crescentus (50), there is an overlap of 326 genes. As expected, the majority of the core essential genes in these three species encode proteins involved in conserved housekeeping functions like cell division, DNA replication, transcription, and translation.
We also applied TSAS to the R. sphaeroides Tn5 library grown aerobically in SMM to test its ability to identify genes that are required for growth in a minimal medium. Not surprisingly, our analysis revealed that many of the conditionally essential genes for aerobic growth in SMM are known or predicted to synthesize the amino acids, purines, pyrimidines, and other cellular constituents that are absent in this medium (Table S3). We also showed that deletion of RSP_1566, one of the four previously uncharacterized hypothetical proteins identified as essential for aerobic growth in SMM, was lethal under these conditions (Fig. 5). We are currently testing the role of other hypothetical proteins in aerobic conditions in SMM.
We also tested the utility of this approach in identifying the role of genes in photosynthetic growth. As expected, many of the genes identified by TSAS to be conditionally essential for photosynthetic growth mapped to the PGC (Fig. 6), a genomic region containing photosynthesis-related genes in R. sphaeroides and related bacteria (51)(52)(53)(54). These essential genes include many of the bacteriochlorophyll biosynthesis (bch) genes and genes encoding pigment-binding proteins of the photosynthetic apparatus (puhA and pufLM, encoding reaction center subunits; pufBA, encoding subunits of the B870 light-harvesting I complex; and RSP_0290, encoding a lightharvesting I [B870] complex assembly protein) that have been shown previously to be required for photosynthetic growth (52).
Our analysis supports previous conclusions that not all genes in the PGC are required for photosynthetic growth. For instance, a previous study using localized mutagenesis of the R. sphaeroides PGC recovered photosynthetically competent strains containing insertions in all of the carotenoid biosynthesis (crt) genes with the exception of crtF (52). It was not surprising that our Tn-seq data showed that none of the crt genes were essential since carotenoids are accessory pigments not directly involved in light energy capture (55). The gene encoding the PufX protein (RSP_0255) in the PGC has been shown to be required for photosynthetic growth (56,57), though photosynthetic growth can restored in pufX mutants with second-site mutations in the light-harvesting complex B875 (58,59). Our data classified pufX as nonessential, and while such discrepancies may be explained by secondary mutations, differences in light intensities or methods used to score photosynthetic growth, or statistical cutoffs used for TSAS analysis, they illustrate the need to confirm the findings made by analyzing Tn-seq libraries. In this regard, our analysis predicts a role for several newly identified or unexpected genes in photosynthetic growth, including four hypothetical genes and others encoding homologs of glycogen synthase and glutamate racemase enzymes. Future analysis of these genes could identify new functions that are important for the photosynthetic lifestyle of R. sphaeroides and related photosynthetic bacteria.
Comparison of essential genes to those predicted by GEMs. The ability of TSAS to correctly identify genes that are essential for growth motivated us to compare these genes to those predicted by the R. sphaeroides GEM. When we compared the essential genes identified by TSAS with those predicted by iRsp1140, we found a high degree of agreement between the two data sets, which was improved by refining the metabolic model using the Tn-seq data (Fig. 7). We identified genes called essential by iRsp1140 or TSAS alone and made targeted changes to iRsp1140 to reconcile these differences. By modifying GPR rules and biomass composition, adding or deleting reactions, modifying reaction reversibility, or turning off certain reactions in iRsp1140 under defined growth conditions, we were able to achieve better agreement between the metabolic model and the Tn-seq data sets. To our knowledge, this is one of the few comparisons between essential gene predictions by flux balance analysis (FBA) and analysis of a large-scale transposon mutant library (15,16).
Of the remaining genes for which TSAS and iRsp1140 disagree, many of these differences likely reflect the limitations of metabolic models. For instance, several tricarboxylic acid (TCA) cycle genes are identified as essential by TSAS, but iRsp1140 can cycle metabolites through amino acid and nucleotide synthesis pathways to produce the needed intermediates. However, iRsp1140 is likely making erroneous predictions since flux through nucleotide and amino acid synthesis reactions would be low compared to that through central carbon metabolism. While iRsp1140 predicts lower growth rates when these nonessential TCA cycle genes are deleted, the reduction is only 2 to 5% lower than the optimal growth rate. By further constraining maximum fluxes through nucleotide and amino acid synthesis, iRsp1140 would predict lower growth rates. In addition, ATP synthase genes are considered nonessential in iRsp1140 but are identified as essential by TSAS. While growth in iRsp1140 is only about 10% of optimum when ATP synthase genes are deleted, the model predicts that cells acquire enough energy through substrate-level phosphorylation to maintain growth, albeit at a reduced rate.
For some genes identified as essential with TSAS but not with iRsp1140, it is likely that they have additional functions or broad metabolic functions. One example is a SpoT/ RelA family protein (RSP_1670) that is associated with a nonessential reaction in iRsp1140. It is expected, however, that this gene plays an essential role in transcriptional regulation (60,61). Furthermore, some genes, such as RSP_2779 (catA, catalase), are involved in quenching reactive oxygen species. In the future, accounting for stress response and regulatory mechanisms to achieve more holistic biological models should further increase the agreement of gene essentiality predictions with Tn-seq data.
Additionally, TSAS shows that R. sphaeroides may have preferred pathways for synthesizing essential biomass components such as nucleotides. While known alternative nucleotide synthesis pathways exist, and R. sphaeroides contains some genes for these alternative pathways, Tn-seq data show that specific pathways are dominant. It should be noted that R. sphaeroides may be able to utilize these alternate pathways at a sufficient capacity to support growth under specific conditions.
Comparisons between TSAS and iRsp1140 provided some feedback on the statistical cutoffs used in the TSAS analyses, as well as highlighting the limitations of a statistically rigorous approach to calling essentiality. For example, there were 76 genes predicted only by iRsp1140 to be essential for aerobic growth in SMM. Of these, 37 genes contained fewer than four transposon insertions and yet were not called essential in our TSAS analysis. Eleven of these 37 genes would have been called essential with a less-stringent adjusted P value cutoff in TSAS, but the remaining 26 genes provide some insight into the limitations of one-sample analyses in reliably calling genes of small size as essential. In these cases, the small size of the genes and the number of insertions in the library prevent them being called essential, regardless of the statistical cutoff.
In sum, we have shown that the knowledge acquired by statistical analysis of Tn-seq libraries can have many uses. Analysis of genome-scale mutant libraries, when coupled with metabolic modeling, high-throughput phenotyping, and other studies that probe gene-function relationships, should provide a way to identify important growth capabilities and the essential underlying genes.

MATERIALS AND METHODS
Bacterial strains and growth conditions. R. sphaeroides 2.4.1 was used in this study. WM6439 is an auxotrophic E. coli strain requiring 2,6-diaminopimelic acid (DAP) for growth. R. sphaeroides was grown at 30°C in SMM (62) and supplemented with 25 g/ml kanamycin, when required. WM6439 was grown at 37°C in LB supplemented with 50 g/ml kanamycin and 0.1 mM DAP.
Transposon library construction. R. sphaeroides was mutagenized by Tn5 after conjugation with E. coli WM6439 containing plasmid pRL27 (63). Briefly, E. coli WM6439(pRL27) and R. sphaeroides 2.4.1 cells in mid-exponential phase were mixed 1:1, pelleted, resuspended in LB supplemented with 0.1 mM DAP, spotted onto LB plates containing 0.1 mM DAP, and incubated for 3 h at 30°C. Cells were recovered from plates with liquid LB, pelleted, resuspended in LB with glycerol (25% [vol/vol] final concentration), frozen in a dry ice-ethanol bath, and stored at Ϫ80°C. The mixtures from two independent matings were thawed, plated on LB agar supplemented with 25 g/ml kanamycin, and incubated at 30°C for 7 days. Kanamycin-resistant R. sphaeroides colonies were scraped from the plates, resuspended in LB supplemented with 25 g/ml kanamycin, mixed with glycerol to a final concentration of 25% (vol/vol),

Tn-seq in Rhodobacter sphaeroides
May/June 2017 Volume 2 Issue 3 e00015-17 msystems.asm.org 13 aliquoted, frozen in a dry ice-ethanol bath, and stored at Ϫ80°C. The final concentration of cells in the mutant library is 1.2 ϫ 10 10 CFU/ml. Transposon library propagation. For outgrowth experiments, 50 l of the frozen library was used to inoculate cultures containing 500 ml SMM supplemented with 25 g/ml kanamycin. Aerobic cultures were sparged with 69% N 2 , 30% O 2 , and 1% CO 2 . Photosynthetic cultures were sparged with 99% N 2 and 1% CO 2 . Cultures were incubated for approximately seven doublings, as monitored by OD 600 . Cells were collected by centrifugation for 10 min at 5,000 ϫ g, washed with 25 ml phosphate-buffered saline (PBS), collected by centrifugation for 5 min at 5000 ϫ g, resuspended in 2 ml PBS, mixed with glycerol to a final concentration of 25% (vol/vol), frozen, and stored at Ϫ80°C.
Tn-seq sample preparation. Transposon-genome junctions were recovered using a modified version of the Tn-seq circle method (27). Briefly, genomic DNA was isolated using a DNeasy Blood and Tissue kit (Qiagen), including the RNase A digestion step, and sheared to an average size of 400 bp using a Misonix S4000 sonicator with cup horn 431C and the following settings: 50% amplitude and 6 periods of 30 s on/off. Sheared DNA was purified and concentrated using a QIAquick PCR purification kit (Qiagen). Five micrograms of sheared DNA for each sample was used as input for the following enzymatic steps. End repair, A-tailing, and adapter ligation were performed using the Kapa DNA library preparation kit for Illumina (Kapa Biosystems). Illumina-compatible NEXTflex PCR-free barcode adapters were purchased from Bioo Scientific. Enzymatic cleanups between steps were performed with Agencourt AMPure XP magnetic beads (Beckman Coulter, Inc.). Adapter-ligated DNA was digested with HindIII-HF (New England Biolabs) and dual size selected (0.6ϫ and 0.8ϫ cuts) using AMPure XP magnetic beads (Beckman Coulter, Inc.). Circularization and exonuclease treatment followed previously described procedures (27), using Collector_ oligo. DNA was amplified with Kapa Hi-Fi polymerase (Kapa Biosystems) according to the manufacturer's protocol for 24 to 26 cycles (SLXA_FOR_AMP2 and index-appropriate REV_AMP primers). DNA was quantified using an Illumina ABI Prism library quantification kit Kapa Biosystems) on a 7500 real-time PCR system (Applied Biosystems). Libraries were sequenced on an Illumina HiSeq2000 sequencer (100 bp, single read) at the University of Wisconsin Biotechnology Center using Tn-seq_Illumina_primer. See Table S7 in the supplemental material for primer sequences.
Tn-seq data analysis. Illumina sequencing data were mapped to the R. sphaeroides 2.4.1 genome (RefSeq assembly accession no. GCF_000012905.2, Assembly:ASM1290v2) using Bowtie version 1.0.0 (64) with default parameters, except that 50 bases were trimmed from the low-quality (right) end of each read before alignment and only one read was reported (at random) for reads with more than one reportable alignment.
TSAS requires organism-specific data files to process data from a Tn-seq experiment. The following three files are required to run TSAS: (i) aligned read file (i.e., output from short-read alignment software, which maps reads from sequencing experiments to the reference genome; Bowtie, SOAP or Eland result formatted files are accepted), (ii) genome sequence of the target organism in FASTA format, and (iii) genomic coordinates in GFF v3 format. These files are used to assign reads and insertion sites to the genes in which they occur and provide the basis for downstream statistical analysis.
Some common issues with data collected for Tn-seq analysis include bias from library construction, PCR amplification, and nonrandom transposon insertion. To correct for some of these biases, TSAS provides two options. The first is read capping. PCR amplification during sample preparation could exaggerate the number of reads at specific locations. Similarly, hot spots for transposon insertion (if any) could also result in inflated read counts. To correct for these potential biases in the data, we implement a read capping routine that places an upper bound on the allowable number of reads permitted at a given insertion site. Three options are provided to the user: (i) mean of the reads per unique insertion across the genome plus 2 standard deviations, (ii) mean of the reads per unique insertion across the genome, and (iii) median of the reads per unique insertion across the genome. The first option in read capping is implemented by default. The second TSAS option is read weighting: the use of NGS for Tn-seq analysis means that a wealth of DNA sequencing reads is generated for each experiment. While the observed number of DNA sequencing reads per gene is likely anticorrelated with its importance under a given condition, we postulate that the number of insertions per gene is a more important metric for assessing essentiality. Thus, to emphasize the importance of insertions in gene essentiality estimations, while still leveraging the abundance of read count data obtained from NGS, we implemented an additional correction routine that gives greater weight due to the presence of larger numbers of unique insertions: total no. of reads (i) ϫ normalized unique insertions (i)/average normalized unique insertions per gene, where total no. of reads (i) is the total number of reads for gene i, normalized unique insertions (i) is the total number of unique insertion for gene i divided by the length of gene i, and average normalized unique insertions per gene is defined as where n is the total number of genes in the genome. The correction essentially results in an increase in number of reads per gene for genes whose number of normalized insertions is greater than the average and vice versa, weighting the observed reads based on insertions.
In addition to enabling "read capping" and "read weighting," TSAS provides the user an option of setting thresholds on the number of reads mapping to a given insertion site required for that insertion site to be considered for downstream analysis (min_hits), as well as the percentage of the 5= and 3= ends of the genes that may be omitted during statistical analysis as they may not contribute to gene function. For the analyses presented here, 10% of the 5= and 3= ends of genes were excluded from analysis and the minimum hit threshold was set to 10 mapped reads. We allowed four insertions per gene in our essential genes, which provided a buffer against insertions that fell inside the 10% excluded from the 5= and 3= ends.
To make calls on gene essentiality, the summarized gene-level read and insertion counts are statistically evaluated in TSAS using either a one-or a two-sample analysis.
(i) One-sample analysis. When only one condition is available for analysis, a one-sample analysis is run. For this analysis, the observed number of unique insertions is compared to a binomial distribution to determine the probability (P) that a gene is essential: where n i is the length of gene i in base pairs, k i is the number of observed unique insertions in gene i, and p is the probability of insertion across the entire genome (i.e., total number of unique insertions divided by genome size in base pairs). Only the lower tail probability (i.e., P[X Յ x]) is calculated to obtain a P value for essentiality. The P value for improved fitness (i.e., P[X Ն x] or 1 Ϫ P[X Յ x]) is determined to identify genes that may be disadvantageous for growth under the condition of interest. The current implementation assumes a random insertion of the transposon across the genome with no sequence preference and thus may not be suitable for Tn-seq data from site-specific transposons, for which gene-specific probabilities also have to be considered. Use of the binomial distribution helps reduce false-positive essentiality calls for small genes. For instance, if the insertion density of the library is 1 every 500 bp, a 100-bp gene will have a very low probability of getting an insertion. Thus, if the absence of intragenic insertions was used as the criterion for essentiality, such small genes would be deemed essential purely by chance.
(ii) Two-sample analysis. When both a reference and a treatment condition are available, the two-sample analysis approach can be employed. For this analysis, the fold change in the total number of reads per gene is calculated as: where FC i is the fold change for gene i in the treatment condition relative to the reference, total reads treatment (i,cap,weight) is the total number of reads counted for gene i (which may be optionally capped and weighted), and total reads reference (i,cap,weight) is the total number of reads counted for gene i (which may be optionally capped and weighted). Normalization of the bias-corrected reads is achieved through division by the total number of reads for each data set. To determine the statistical significance of the calculated fold changes, P values are calculated using (i) Fisher's exact test, (ii) a proportions-based test, and (iii) Student's t test. The third option is implemented only when a sufficient number of replicate samples is provided.
The P values generated from the one-and two-sample analyses are corrected for multiple testing to generate either a family-wise error rate (Bonferroni correction [65][66][67]) or an adjusted P value (Benjamini-Hochberg correction [68]).
All source code required to run TSAS can be obtained from GitHub at https://github.com/srimam/ TSAS. A detailed description of how to use TSAS and interpret the results generated is provided in the TSAS User Guide (Text S1).
Mutant construction. An in-frame, markerless deletion of RSP_1566 was constructed using the suicide vector pK18mobsacB (69). Briefly, RSP_1566 plus~1-kbp flanking DNA up-and downstream of RSP_1566 was amplified from R. sphaeroides genomic DNA with primers g1566_F_XbaI and g1566_R_HindIII. The amplified products were digested with XbaI and HindIII restriction enzymes (New England Biolabs) and ligated into pK18mobsacB to generate pK18mobsacB_genomicRSP1566. The RSP_1566 coding sequence was deleted from pK18mobsacB_genomicRSP1566 by PCR with phosphorylated primers RSP_1566_deletionF and RSP_1566_deletionR. See Table S7 for primer sequences. The resulting PCR product was ligated to generate pK18mobsacB_ΔRSP1566 and transformed into E. coli DH5␣. pK18mobsacB_ΔRSP1566 was mobilized into R. sphaeroides via conjugation with E. coli S17-1. Transconjugants were selected on LB kanamycin plates under aerobic conditions. Colonies were streaked for purity, resuspended in LB, and plated on LB sucrose (10% [wt/vol]) plates. Isolated colonies were patched onto LB kanamycin and LB sucrose plates to screen for cells that had lost the plasmid. Kanamycin-sensitive, sucrose-resistant colonies containing the desired gene deletion were identified by PCR. Gene deletion was confirmed by Sanger sequencing of the genomic region.
Model comparison. Parsimonious FBA (pFBA) was used to predict essential metabolic genes in R. sphaeroides using the constraint-based model iRsp1140 (18,39). To simulate growth, nutrient uptake rates were set to mimic the composition of SMM, with the free exchange of phosphate, sulfate, magnesium, iron, nicotinate, thiamine, biotin, and CO 2 permitted (i.e., upper and lower bounds of associated exchange reactions were set to 1,000 and Ϫ1,000 mmol g dry weight [gDW] Ϫ1 h Ϫ1 , respectively). The uptake rates for the limiting substrates succinate and ammonium were set to 2.5 and 1 mmol gDW Ϫ1 h Ϫ1 , respectively, based on previous measurements (17). For simulation of aerobic growth, the free exchange of oxygen was permitted, while the photon uptake rate was set to 0 mmol gDW Ϫ1 h Ϫ1 . Conversely, for simulation of anaerobic photosynthetic growth photon uptake was unconstrained, while oxygen uptake was set to 0 mmol gDW Ϫ1 h Ϫ1 . To simulate single-gene deletions under each condition, the gene-protein-reaction (GPR) Boolean logic rules within iRsp1140 were used to determine the upper and lower bounds for the associated reaction(s). For instance, if reaction A had a GPR rule consisting only of gene B, then the upper and lower bounds of reaction A are set to 0 mmol gDW Ϫ1 h Ϫ1 to simulate the impact of deleting gene B. On the other hand, if reaction A is catalyzed by multiple isozymes, the reaction bounds would be unchanged for simulating the deletion of a gene encoding one of those isozymes. Once reaction bounds had been set, FBA was performed as previously described (39). Gene deletions that resulted in a predicted biomass growth rate of 0 h Ϫ1 were considered essential for growth under the conditions tested. pFBA, gene essentiality determination, and reconstruction modifications were made using COBRApy (70). The Python code for modifying the original iRsp1140 SBML file is included in Data Set S1.
Accession number(s). Illumina sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) as BioProject PRJNA376384.