Comparative Genomic Analysis Reveals Habitat-Specific Genes and Regulatory Hubs within the Genus Novosphingobium

This study highlights the significant role of the genetic repertoire of a microorganism in the similarity between Novosphingobium strains. The results suggest that the phylogenetic relationships were mostly influenced by metabolic trait enrichment, which is possibly governed by the microenvironment of each microbe’s respective niche. Using core genome analysis, the enrichment of a certain set of genes specific to a particular habitat was determined, which provided insights on the influence of habitat on the distribution of metabolic traits for Novosphingobium strains. We also identified habitat-specific protein hubs, which suggested delineation of Novosphingobium strains based on their habitat. Examining the available genomes of ecologically diverse bacterial species and analyzing the habitat-specific genes are useful for understanding the distribution and evolution of functional and phylogenetic diversity in the genus Novosphingobium.

The number of contigs/replicons or the number of chromosomes (Chr) and plasmids is shown.
Comparative Genomic Analysis of Novosphingobium spp.
defined previously by DNA-DNA hybridization (DDH) should be reclassified to 59% to 67%. A previous study suggested that GC content is predicted to significantly influence the functional potential and hence ecological adaptation of an organism (20). However, the variability in percent GC content for Novosphingobium was not significant between the four habitats (F 3,15 ϭ 0.308 and P Ͻ 0.82 by ANOVA), suggesting that the ecological adaptations of Novosphingobium spp. are not influenced by a shift in percent GC content. Core genome and pangenome analysis. Bacterial pangenomes typically consist of distinct core and accessory gene complements (21). Novosphingobium maintained a core gene complement of 220 genes (query coverage of Ն75% and nucleotide Identity of Ն75%) for the 27 genomes analyzed. As expected, these orthologs include components of regulatory pathways such as DNA replication, basic transcriptional machinery, translation, mismatch repair, nucleotide excision repair, homologous recombination, signal transduction, bacterial secretion system and protein export. In addition, citric acid cycle, fatty acid biosynthesis and elongation, amino acid biosynthesis and purine metabolism were also present. However, only 128 of the 220 orthologous genes could be reliably annotated as "essential" against the DEG database (22), whereas the remaining 92 accessory genes still coded for basic metabolic functions.
Pangenome analysis of the 27 Novosphingobium strains ( Fig. 1) identified 21,915 nonredundant (nonrepetitive) genes in the pangenome, out of 128,647 total genes. The genome curve displayed an asymptotic trend, indicating that 27 genomes were insufficient to describe the complete gene repertoire of the genus Novosphingobium. Analysis of the core genome was also asymptotic, with 714 core genes after the addition of the 27th genome; however, this trend suggests that further Novosphingobium genomes will result in only minor changes in the core genome of this genus (Fig. 1).
Habitat-specific traits. The orthologous gene contents for Novosphingobium strains in four habitats were identified, and a pairwise comparison was performed to obtain habitat-specific genes. Out of 17,976 redundant orthologous genes, 1,943 gene sets were core genome for rhizosphere, 1,530 for contaminated soil, 1,485 for freshwater, and 1,546 for marine water. Further, comparison of the core genome of each habitat with respect to another revealed the presence of 438 specific genes for rhizosphere, 346 for contaminated soil, 143 for marine water, and 297 for freshwater. These habitat-specific genes were annotated against the KAAS server (23), but only 211 rhizospheric, 125 contaminated soil, 54 marine, and 150 freshwater genes could be annotated with a KEGG Orthology (KO) identifier. These KO identifiers were mapped against metabolic pathways using iPath (24), and the differences were mostly observed in amino acid metabolism, suggesting different amino acid availabilities in these environments (see Fig. S1 in the supplemental material). Rhizosphere-specific gene content consists of genes encoding components involved in glycine, serine, and threonine metabolism. Contaminated-soil-specific gene content consists of genes encoding components involved in tyrosine and phenylalanine metabolism. Freshwaterspecific gene content contain genes encoding components involved in alanine, aspartate, and glutamate metabolism, and marine water-specific gene content contain genes encoding components involved in the bacterial chemotactic regulatory pathway, which could be involved in nutrient acquisition in this normally oligotrophic environment. Genes related to terpenoid backbone biosynthesis were present only in the core genomes of rhizospheric strains, which has been shown to play a role in the stability of bacterial cell membranes and root interaction in rhizospheric strains (25). Therefore, the analysis has put forward the differences between Novosphingobium strains based on differences in the metabolic preferences for amino acids in their respective habitats, representing the resultant adaptive changes in response to the environment.
Distribution of Novosphingobium strains along their phylogenetic clade. The consensus phylogeny of Novosphingobium spp. has shown the mixed trend of phylogenetic clustering of strains isolated from a similar environment. For instance, N. bar-chamii strain LL02 (contaminated soil), Novosphingobium sp. strain P6W (rhizosphere), and Novosphingobium sp. strain AP12 (rhizosphere), despite belonging to different environments, clustered together. While Novosphingobium sp. strain ST904 and N. lindaniclasticum LE124, which were both isolated from contaminated soil, form a monophyletic clade ( Fig. 2 and Table 1). Notably, strains LL02 (13) and LE124 (3) were isolated from hexachlorocyclohexane (HCH) dumpsites, but in all three methods (conserved marker genes and average nucleotide identity [ANI] on the whole genome and core Pangenome. The x axis shows the number of genomes added, and the y axis shows the increase in pangenomic content of Novosphingobium spp. with the addition of genomes. The sizes of the core and pangenome clusters were computed using the BDBH algorithm. For the robustness of the calculation, the built-in program runs the sampling experiments (n ϭ 10), where genomes are randomly added to estimate the stability of the core and pangenome. The best-fit Tettelin curve represents the regression line for the core and pangenome. genome), these strains clustered separately. Similarly, Novosphingobium sp. strain KN65.2 was isolated from carbofuran-contaminated soil but clustered with marine isolates, Novosphingobium sp. strain PP1Y (6) and N. pentaromaticivorans US6-1 (10). This clustering is likely a result of shared metabolic tendency, as strain KN65.2 can degrade carbofuran (2) and strains PP1Y and US6-1 can degrade polyaromatic hydrocarbon (PAH) compounds (6,10). Further ambiguity in habitat specificity was observed from the clustering of strains of marine, contaminated soil, and freshwater habitats (N. ). The results indicated that the phylogenetic clustering of genomes was apparently different from the habitat-specific grouping of these strains. This may be because Novosphingobium spp. have varied metabolic preferences, suggesting that habitat-specific factors are probably masked by the microenvironment in shaping the Novosphingobium genomes. Also, the differences in tree topology using these two methods, i.e., ANI (whole genome based) and 400 conserved bacterial marker genes, could be due to the inclusion of pangenomic content in the case of the whole genome (ANI) rather than the conserved marker genes. Further, to check the impact of the missing gene content from draft genomes, the phylogeny was constructed on the core genome using ANI. The result suggested that the least complete genome (Х93.46% [ Table 1]), i.e., Novosphingobium sp. strain ST904 grouped with N. lindaniclasticum LE124 by all three methods. Thus, it can be inferred that the missing gene content will have the least impact on the change in phylogeny among the Novosphingobium strains.
Habitat-specific protein identification and their protein-protein interaction analysis. The phylogenomics of the different strains did not reflect their habitat specificity, which suggests that the functional repertoire of these strains may supersede evolutionary relatedness. Protein-protein interaction (PPI) networks enable biological characteristics and protein function to be taken into consideration for each strain (26) and can be used to identify habitat-specific adaptations (27). To confirm that the proteome interaction with the environment, particularly for the uptake and secretion of molecules, is highly habitat specific, we aimed for the identification of putative outer membrane proteins involved in the transport of metabolites and toxins, as well as membrane biogenesis (28). We focused on proteins characterized as trans-membrane beta-barrel proteins (TMBbps) in Novosphingobium proteomes. The analysis showed the presence of different numbers of TMBbps in each strain of Novosphingobium across the four habitats. The identified TMBbp sequences of different strains clustered together based on habitat, when subjected to protein sequence similarity analysis. The proteins with the highest percentage of similarity were further referred to as habitat-specific proteins (HSPs). To validate their specificity toward the habitat, amino acid sequences of these TMBbps were subjected to phylogenetic analysis, which demonstrated habitatspecific clustering (Fig. S2). To confirm the stability of these proteins as key regulatory molecules, PPI interaction networks were established based on the core genome. To identify the key molecules, networks for each habitat were constructed and analyzed ( Fig. 3A to D). The hub proteins for each strain in all four habitats were identified (Table S1). To understand the topological properties of these networks, the probability of degree distribution P(k) showed that each network followed a power law scaling behavior with the values of the degree exponent ␥ were ϳ0.52, 1.0, 0.43, and 0.59 in freshwater, marine water, rhizosphere, and contaminated soil habitats, respectively (Fig. 4A). The small value of ␥ (␥ Ͻ 2) indicated that the network was hierarchical (29), signifying the emergence of hierarchical modules and/or communities (30), with a sparse distribution of highly connected hubs (31). The fact that these few highly connected hubs were connected to many low-degree nodes was indicative of a regulatory power of the hubs over these nodes. For further analysis of this topological feature of the network (30), the average clustering coefficient C(k n ) was calculated as a function of the number of neighbors k n : Again, this followed the power scaling law with ␤ values of ϳ0.31, 0.40, 0.73, and 0.36 in freshwater, marine water, rhizosphere, and contaminated soil habitats, respectively, which supported that the network falls in a hierarchical network (Fig. 4B).
The average neighborhood connectivity C n (k n ) was constructed as a function of k n as follows: with values of~0.42, 0.24, 0.36, and 0.33 in freshwater, marine water, rhizosphere, and soil habitats, respectively ( Fig. 4C), also indicating that the network falls in a hierarchical network (30,31), the hub proteins in each habitat network are likely indicative of key molecules for habitat adaptation in each genome (32), and these proteins had the highest degree of interactions in these hierarchical networks. Hub proteins of each habitat were identified, and these proteins include the Saro_1868 protein (TonBdependent receptor) for the freshwater habitat (Fig. 3A), PP1Y_AT17644 protein (hypothetical protein with porin domain) for the marine habitat (Fig. 3B), PMI02_00367 protein (TonB-dependent receptor) for the rhizosphere habitat (Fig. 3C), and V474_17210 protein (TonB-dependent receptor) for the soil habitat (Fig. 3D). As these ␤-barrel outer membrane proteins are present on the surfaces of Gram-negative bacteria and perform a variety of functions such as active ion transport, passive nutrient uptake, membrane anchors, membrane-bound enzymes, and in defense (33), they are likely crucial for the adaptation of the Novosphingobium strains in their respective environments.
Sulfur uptake and metabolism are different between habitats. The sulfur metabolism pathway in prokaryotes involves the uptake and utilization of environmental sulfur derivatives for the synthesis of proteins, sulfate esters of polysaccharides, phenols, steroids, and coenzymes. In general, there are three different routes for the assimilation of environmental sulfur (Fig. 5). The first and predominant mode includes the uptake and metabolism of sulfates in the form of inorganic sulfur (sulfates and thiosulfates) which is carried out by proteins encoded by cysPAUW (transport system) (34) and cysD and cysNC (activation and utilization) (35) followed by cysteine biosyn-Comparative Genomic Analysis of Novosphingobium spp.

FIG 3
The protein-protein interaction (PPI) network of four habitats, i.e., freshwater, marine water, rhizosphere, and soil. Expanded view of the network imported from Cytoscape, where nodes represent proteins and edges represent physical interactions. The nodes in all four habitats (freshwater, marine water, rhizosphere, and contaminated soil) were represented as filled circles that were light red, green, dark blue, and light blue, respectively. The edges in all habitats were represented in the form of grey lines. The significant existence of sparsely distributed hubs in four habitat networks were represented by colored circles as purple (freshwater), dark blue (marine), orange (rhizosphere), and pink (contaminated soil). thesis genes cysE, cysK, and cysQ. The second route involves the uptake and utilization of environmental sulfonates, characterized by the presence of the ssuABC (transport system) and ssuD (FMNH 2 -dependent alkane sulfonate monooxygenase) genes. The alkane sulfonates comprise the major portion of carbon-bonded environmental sulfur (68%) (36) and 20 to 40% of organic sulfur present near marine sediments (37). The third route of sulfur assimilation involves taurine transport and metabolism encoded by the tauABC (transport system) and tauD (taurine dioxygenase) genes, respectively.
Studies related to sulfur assimilation in bacteria isolated from different habitats have revealed the coexistence of these routes in the same species (38), but to date, no study has determined the distribution of these three pathways across different habitats. To determine this for Novosphingobium in rhizosphere, contaminated soil, freshwater, and marine water, the genes involved in sulfur metabolism were identified and strains were clustered according to their sulfur assimilation repertoire. Four resultant clades were designated: clade I, clade II, clade III, and clade IV (Fig. 6). Although clustering of the strains based on habitat was not observed, the pattern of differentiation of pathways was clearly demarcated. For instance, sulfate metabolism, the most predominant mode of environmental sulfur assimilation, was found only in clade I (strains MBES04, LE124, FNE08-7, and AAP93) and clade II (strains ST904, AP12, P6W, LL02, and NBRC15208) (Fig. 6). Further, the complete pathway of alkane sulfonate assimilation was found exclusively in strains clustered in clade II, which comprised only soil isolates (rhizosphere and contaminated soil). Earlier, the alkane sulfonate assimilation system had been reported in freshwater isolates (38), but none of the freshwater isolates we studied maintained the system. In addition to this, tauD coding for taurine dioxygenase was identified in all of the Novosphingobium strains, while the taurine transport system was absent. The two other clades, clades III (comprised of mainly aquatic isolates) and IV, lacked a complete sulfur transport system, instead maintaining a mosaic of genes encoding components involved in sulfate oxidation, taurine oxidation, and sulfonate oxidation, which suggests the use of multiple sulfur derivatives. Interestingly, the strains isolated from contaminated soil were found in all four clades and therefore maintained a diverse array of sulfate metabolism. This suggests that the modes of sulfur assimilation in Novosphingobium spp. were not confined to a certain habitat but might relate to the availability of different types of environmental sulfur compounds in their respective habitats.
Mechanisms for survival in marine environments are also observed in contaminated soils. In general, there are two different strategies that are known to confer bacterial survival in a saline environment. These strategies include accumulation of inorganic components in the cytoplasm, which counterbalances the salinity (39), and synthesis of the organic osmolytes that do not increase the ionic concentration but maintain the osmotic pressure (40). Two such osmotic solutes are ectoine (1,4,5,6tetrahydro-2-methyl-4-pyrimidine carboxylic acid) and hydroxyectoine, which are common osmolytes in marine and halotrophic bacteria (41)(42)(43). The ectoine biosynthesis pathway involves components encoded by the ectA (L-2,4-diaminobutyric acid acetyltransferase), ectB (L-2,4-diaminobutyric acid transaminase), and ectC (L-ectoine synthase) genes (44). In addition to this, the protein encoded by ectD (ectoine hydroxylase) catalyzes the conversion of ectoine into hydroxyectoine (45).
Ectoine biosynthesis is considered to be an adaptation of marine Novosphingobium strains, such as Novosphingobium sp. strain PP1Y and N. pentaromaticivorans US6-1, which were previously reported to possess the ectoine biosynthesis pathway (19). However, we found that among the marine isolates, only N. malaysiense Musc273 along with PP1Y and US6-1 maintained a complete ectoine biosynthesis pathway, while two other marine isolates, N. subterraneum DSM12447 and Novosphingobium sp. strain MBES04, did not possess any of the ectoine pathway genes. The complete absence of the ectoine pathway in marine strains MBES04 and DSM12447 suggested that these strains might use different routes to compensate for high-salt conditions of marine water. Another possible reason might be that both strains are not truly marine, as the former was isolated from sunken wood (5) while the latter was isolated from coastal plains at a depth of 180 m (unpublished). Interestingly, strains isolated from other habitats were found to possess genes for ectoine biosynthesis, such as Novosphingobium sp. KN65.2, a carbofuran-contaminated soil isolate, which possessed the complete ectoine biosynthesis pathway. In addition to this, ectA and ectB were identified in N. barchamii LL02 and Novosphingobium sp. ST904, isolated from hexachlorocyclohexane-and 2,4,6-trinitrotoluene-contaminated soil, respectively. Also, rhizospheric strains, Novosphingobium sp. P6W and N. rosa NBRC 15208 were found to possess ectA and ectB, respectively, while freshwater strains were completely devoid of genes for ectoine biosynthesis. The occurrence of ectoine pathway genes in strains from contaminated soil and rhizosphere habitats implies that ectoine synthesis may not be a habitat-specific trait but it may instead be acquired and maintained by strains from different ecological niches, likely driven by environmental stress, or that the pathway is not useful but simply maintained in the contaminated soil environment.
Degradation potential of Novosphingobium strains across four different habitats. Sphingomonads have been widely reported as efficient degraders of xenobiotic compounds such as hexachlorocyclohexane, chlorophenol, phenol, homogentisate, anthranilate, and other polyaromatic hydrocarbons (46,47). Of the sphingomonads, Sphingobium and Sphingomonas strains have been extensively studied with respect to their xenobiotic degradation potential (12,48), while less is known about Novosphingobium spp. A comparative genomic study on six Novosphingobium strains was carried out earlier (19), but the focus was on overall genomic repertoire. Here we analyzed Novosphingobium genomes for the presence of aromatic compound degradation pathway genes. The analysis revealed that the genes encoding PAH and components involved in xenobiotic degradation were enriched in Novosphingobium strains (Fig. 7) among which freshwater strains showed similarity in genes encoding mono-and dioxygenases, with very similar metabolic profiles, while strains from the other three habitats clustered separately ( Fig. 7A and B). Of note, N. rosa NBRC15208, a rhizospheric isolate, was found to harbor the highest number of genes (n ϭ 157) for aromatic compound degradation, especially for gentisate, protocatechuate, and catechol. The two other rhizospheric strains, Novosphingobium sp. P6W (n ϭ 45) and Novosphingobium sp. AP12 (n ϭ 59), contained only 33% of the N. rosa NBRC15208 gene complement. Following this, Novosphingobium sp. KN65.2 (contaminated soil) and Novosphingobium sp. PP1Y (marine) with 124 genes each, had the second highest metabolic repertoire. Novosphingobium sp. KN65.2 possessed genes mainly for gentisate, biphenyl, homogentisate, and protocatechuate degradation, while Novosphingobium sp. PP1Y possessed a high number of gentisate and biphenyl degradation genes. Interestingly, strains from sites contaminated with HCH, polychlorinated dioxin, pulp mill effluent, and carbofuran contained comparably fewer genes for aromatic compound degradation, which suggested that particular contaminants might lead to genome streamlining under environmental stress. Also, genes for gentisate, catechol, and protocatechuate catabolism were found in abundance, projecting their ability to degrade a variety of aromatic compounds (49).
Phage integration and genomic adaptation. Phage/virus integration in bacterial genomes is often considered a genomic adaptation mechanism of bacterial strains which enables novel gene acquisition and might be critical for survival. It has been reported that integrated prophages can constitute up to 20% of a bacterial genome (51), which eventually leads to strain emergence and diversification. Such genomic reservoirs have been shown to be highly diverse across aquatic and terrestrial ecosystems (52). In this study, 29 intact phages in Novosphingobium genomes (Table 2) were identified, with the greatest diversity in strains from contaminated soil (12 phages). Although most of the proteins encoded by the integrated phage were either phage related or hypothetical, a few of the annotated proteins, such as arsenic resistance, NADH-dependent flavin mononucleotide (FMN) reductases, dioxygenase, and permease, could provide improved resistance and degradation of polyaromatic hydrocarbons (PAHs) ( Table 2).
Novosphingobium strains from marine habitats had the second greatest abundance of phage content. This may be due to the fact that viruses are very common in oligotrophic marine environments (53,54). Interestingly, Novosphingobium sp. MBES04 acquired the gene encoding 5-oxoprolinase via phage-mediated horizontal gene transfer, which catalyzes 5-oxoproline conversion into glutamate. Pyroglutamic acid or 5-oxoproline is an osmolyte that helps in the maintenance of osmotic balance along with sucrose and ectoine, predominantly characterized in bacteria inhabiting environments with high salt concentrations (55). Further, studies have also shown the role of glutamate in osmoregulation (56). Although the complete pathway for pyroglutamic acid synthesis was absent, the strain MBES04 might be using an alternative pathway and thus acquiring these features for streamlining the genome with respect to the habitat. Apart from this, marine strains have shown the acquisition of ompA and motB genes (MBES04), generally found in the outer membranes of Gram-negative bacteria (57) and known to influence bacterial attachment (58). Hence, this is predicted to further boost the chemotactic behavior of marine bacteria. Further, the acquisition of phage-mediated transcription initiation factor, elongation factor, and regulators may help in activation of adaptive genes (59). Thus, Novosphingobium strains have shown the well-developed phage acquisition-adaptation machinery that might play an important role in combating stress from the environment they inhabit.
Conclusions. The phylogenetic relationship among Novosphingobium strains was not completely concordant with habitat, as only some strains clustered with strains from similar habitats. The overall genetic repertoire played a significant role in structuring the similarity between strains, suggesting that habitat has little influence on the phylogenetic relationship. However, a systems biology approach revealed habitatspecific protein hubs that were able to delineate Novosphingobium strains based on their habitats. Further, metabolic genes with significant habitat-specific delineation were determined. For instance, sulfur acquisition was differentially encoded among strains and habitats, while the alkane sulfonate assimilation pathway was common among all rhizospheric strains. The ectoine biosynthesis pathway, predominantly identified for osmoregulation in marine bacteria, was also identified in strains isolated from other habitats, suggesting its significance beyond the marine habitat. Aromatic com- Comparative Genomic Analysis of Novosphingobium spp.
pound degradation and abundance of mono-and dioxygenase genes across all strains in all habitats suggest that Novosphingobium represents an untapped resource for the field of biotechnology. Abundance of integrated phage and resultant acquisition of genes that confer stability in their habitat are signs of well-developed phage gene acquisition machinery in Novosphingobium.

MATERIALS AND METHODS
Gene prediction and annotation. Novosphingobium genomes, including both draft and complete genomes, were retrieved from the NCBI database (Table 1). One strain, Novosphingobium lindaniclasticum LE124, was sequenced by our laboratory using an Illumina genome analyzer and 454 GS FLX titanium platform, and reads were assembled into 156 contigs (60). For all of the Novosphingobium strains, genome annotations were carried out using RAST version 2.0 (61) and gene caller Glimmer-3 (62). Orthologs were predicted using the sequence clustering algorithm, COGtriangles (63) available in GET_HOMOLOGUES software package (64) with both identity and query coverage of Ն75%, using amino acid sequences. Further, the presence of essential genes in the orthologs was identified using the Database of Essential Genes (DEG) version 13.3 (22). The genome completeness was estimated by analyzing the presence of 107 essential copy genes using the Comprehensive Microbial Resource as a database, where 107 hidden Markov models (HMMs) of essential copy genes were analyzed in all of the Novosphingobium strains (65).
Pangenomic and core genomic trend analysis. For each genome, amino acid sequences were retrieved from RAST version 2.0 (61) and were used for pangenome and core genome trend analysis using the bidirectional best-hits (BDBH) clustering algorithm (64) at default parameters. Thereafter, the number of genes was plotted against the number of genomes added in the analysis, with Tettelin fitted curve (82).
Phylogenetic analysis. In order to obtain congruency in the phylogeny of Novosphingobium strains, three different methods were used. In the first method, phylogenetic clustering was performed based on protein sequences of 400 marker genes of Novosphingobium strains (66). The maximum likelihood methodology was used for the construction of the phylogenetic tree, using S. indicum B90A as an outgroup. In order to further demarcate the phylogeny of Novosphingobium strains, two other methods based on pairwise average nucleotide identity (ANI) (67) were used; the first method involves pairwise ANI comparison between 220 orthologous genes, and the second method employed whole-genome sequences to account for both core and accessory genome content. Two-way matrices were prepared, and dendrograms were constructed by the Pearson correlation method and hierarchical clustering using MeV (68).
Habitat-specific genes and their metabolic pathways. In order to identify the habitat-specific traits of the genus Novosphingobium, we divided the genomes into four different habitats, rhizosphere, contaminated soil, marine water, and freshwater (Table 1). Strains belonging to these habitats were included for further analysis. The strains isolated from the rhizosphere were Novosphingobium sp. AP12, Novosphingobium sp. P6W, and N. rosa NBRC 15208. The strains isolated from contaminated soil were N. barchamii LL02, N. lindaniclasticum LE124, N. naphthalenivorans NBRC102051, Novosphingobium sp. KN65.2, and Novosphingobium sp. ST904. The strains isolated from freshwater were Novosphingobium sp. AAP1, Novosphingobium sp. AAP83, Novosphingobium sp. AAP93, N. fuchskuhlense FNE08-7, N. aromaticivorans DSM12444, and N. acidiphillum DSM19966. The strains isolated from marine water were Novosphingobium sp. MBES04, N. malaysiense Musc273, N. subterraneum DSM12447, N. pentaromaticivorans US6-1, and Novosphingobium sp. PP1Y. Initially, the core genome content of each habitat was predicted by clustering the genomes with the COGtriangles algorithm (as described above). Then, the core genome of each habitat was compared to identify the cloud content (i.e., genes that were present in Յ2 habitats). Further, habitat-specific genes were retrieved manually, mapped for metabolic pathways using KAAS (23), and visualized using iPATH version 2 (24).
Identification of habitat-specific proteins and their protein-protein interactions. To identify habitat-specific proteins (HSPs), the trans-membrane beta-barrel proteins (TMBbps) (28) were predicted based on the BOMP (Beta-barrel Outer Membrane protein Predictor) program (69). Then, protein sequences of all strains were subjected to TMBbp prediction, and potential proteins were selected for further analysis. All TMBbp sequences of each habitat group were compared using BLASTp, so that the similar proteins could be used for hub identification (70). The TMBbp sequence comparison identified similar sequences present in all of the strains from these four habitats. The topmost sequence is considered a habitat-specific protein (HSP) and subjected to validation using phylogenetic analysis. In order to construct the protein-protein interactions (PPIs), HSP sequences of Novosphingobium strains were searched against the STRING Database (v10) (71). Strains from freshwater and marine water habitats were searched against Novosphingobium aromaticivorans and Novosphingobium sp. strain PPIY, respectively, while the soil and rhizosphere strains sequences were queried against Novosphingobium nitrogenifigens. The STRING v10 database consisted of known and predicted PPIs, which included both direct (physical) and indirect (functional) associations. The associations were integrated with different sources such as genomic context, high-throughput experimental data, database and literature mining, and analysis of coexpressed genes. This allowed an agile exploration of the interactome network and included certain calculated parameters that weighed the reliability of a given interaction (i.e., the "edges" of the interactome network) between two proteins and also qualified the functional environment around any given protein and their interacting partners (i.e., the "nodes" of the interactome network) (72). The PPI networks were visualized using Cytoscape version 3.0.1 (73). The hubs are proteins having a high degree of interactions, randomly placed in the network, and have important functional roles. In the current study, the hubs were identified using network analyzer and Perl programming version 5.18.2.2.
Statistical analysis of the network. The statistical and functional significance of the network was measured using various statistical parameters, namely, probability of degree distribution, average clustering coefficient, and average neighborhood connectivity (74). The degree of probability distribution, P(k), of a network defined by P(k) ϭ n k/N, which is the ratio of the number of nodes having a k degree in the network ( n k) to the size of the network (N), was used to capture the network structure, identification of hubs, and modular organization of the network. The network we constructed obeyed the power law, P(k) ϳ k Ϫ␥ , indicating the scale-free nature of the network, where ␥ is an order parameter that identified the different topological structure of a scale-free network. The clustering coefficient C(k), which is defined by is the ratio of the number of edges E of the node having a k degree with neighbors to the total possible number of such edges, which is a measure of the topological structure of the network (75). The average clustering coefficient C(k) identifies overall organization of formation of clusters in the network. Similar to P(k), C(k) may depend on network size and characterizes various properties of the network: (i) for scale-free and random networks where C(k) is independent of k, C(k) ϳ constant, and (ii) for hierarchical networks where C(k) follows power law scaling behavior, C(k) ϳ k ␤ with ␤ ϳ 1. The neighborhood connectivity of a node is the number of neighbors connected to it and characterizes the correlation pattern of connectivity of interacting nodes in the network. This connectivity correlation would be measured by defining a conditional probability P (k' n |k n ) (6) which is the probability of making a link from a node having degree k n to another node of degree k= n (76). Then, the average neighborhood connectivity of nodes with connectivity k n is given by C n ͑ k n͒ ϭ k' n k' n P ͑ k' n |k n͒~kn Ϫϰ (76) following a power law scaling behavior with ␣ Ͻ 1 for most of the real networks (31,77). If C n (k n ) is an increasing function of k n (for negative values of ␣), then the topology of the network shows assortive mixing (78) where nodes with a high number of edges per node (high-degree nodes) have affinity to connect to other high-degree nodes in the network. However, from equation 3 with positive values for ␣ is the signature of the network having hierarchical structure, where low-degree nodes tend to connect high-degree hubs (78) and few high-degree hubs present in the network try to control the low-degree nodes. Phage and genomic island prediction. Genomes were searched for phage content using the online server PHAST (79). The phage content was then analyzed for the presence of phage-related, hypothetical, and bacterial genes (Table 2). Further, genomic islands (GIs) were predicted using IslandViewer (80).

SUPPLEMENTAL MATERIAL
Supplemental material for this article may be found at https://doi.org/10.1128/ mSystems.00020-17. We acknowledge Phoebe Oldach for helpful suggestions and editing the article.