ABSTRACT
With the exponential increase in the number of bacterial taxa with genome sequence data, a new standardized method to assign species designations is needed that is consistent with classically obtained taxonomic analyses. This is particularly acute for unculturable, obligate intracellular bacteria with which classically defined methods, like DNA-DNA hybridization, cannot be used, such as those in the Rickettsiales. In this study, we generated nucleotide-based core genome alignments for a wide range of genera with classically defined species, as well as those within the Rickettsiales. We created a workflow that uses the length, sequence identity, and phylogenetic relationships inferred from core genome alignments to assign genus and species designations that recapitulate classically obtained results. Using this method, most classically defined bacterial genera have a core genome alignment that is ≥10% of the average input genome length. Both Anaplasma and Neorickettsia fail to meet this criterion, indicating that the taxonomy of these genera should be reexamined. Consistently, genomes from organisms with the same species epithet have ≥96.8% identity of their core genome alignments. Additionally, these core genome alignments can be used to generate phylogenomic trees to identify monophyletic clades that define species and neighbor-network trees to assess recombination across different taxa. By these criteria, Wolbachia organisms are delineated into species different from the currently used supergroup designations, while Rickettsia organisms are delineated into 9 distinct species, compared to the current 27 species. By using core genome alignments to assign taxonomic designations, we aim to provide a high-resolution, robust method to guide bacterial nomenclature that is aligned with classically obtained results.
IMPORTANCE With the increasing availability of genome sequences, we sought to develop and apply a robust, portable, and high-resolution method for the assignment of genera and species designations that can recapitulate classically defined taxonomic designations. Using cutoffs derived from the lengths and sequence identities of core genome alignments along with phylogenetic analyses, we sought to evaluate or reevaluate genus- and species-level designations for diverse taxa, with an emphasis on the order Rickettsiales, where species designations have been applied inconsistently. Our results indicate that the Rickettsia genus has an overabundance of species designations, that the current Anaplasma and Neorickettsia genus designations are both too broad and need to be divided, and that there are clear demarcations of Wolbachia species that do not align precisely with the existing supergroup designations.
INTRODUCTION
While acknowledging the disdain that some scientists have for taxonomy, Stephen Jay Gould frequently highlighted in his writings how classifications arising from a good taxonomy both reflect and direct our thinking, stating, “the way we order reflects the way we think. Historical changes in classification are the fossilized indicators of conceptual revolutions” (1). Historically, bacterial species delimitation relied on phenotypic, morphologic, and chemotaxonomic characterizations (2–4). The 1960s saw the introduction of molecular techniques in bacterial species delimitation through the use of GC content (5), DNA-DNA hybridization (DDH) (6), and 16S rRNA sequencing (7, 8). Currently, databases like SILVA (9) and Greengenes (10) use 16S rRNA sequencing to identify bacteria. However, 16S rRNA sequencing often fails to separate closely related taxa, and its utility for species-level identification is questionable (10–12). Multilocus sequence analysis (MLSA) has also been used to delineate species (13), as has the phylogenetic analysis of both rRNA and protein-coding genes (3, 14, 15). Nongenomic mass spectrometry-based approaches, in which expressed proteins and peptides are characterized, provide complementary data to phenotypic and genomic species delimitations (16, 17) and are used in clinical microbiology laboratories. However, DDH remains the “gold standard” of defining bacterial species (18, 19), despite the intensive labor involved and its inability to be applied to nonculturable organisms. A new genome-based bacterial species definition is attractive, given the increasing availability of bacterial genomes, rapid sequencing improvements with decreasing sequencing costs, and data standards and databases that enable data sharing.
Average nucleotide identity (ANI) and digital DDH (dDDH) were developed as genomic-era tools that allow for bacterial species classification with a high correlation to results obtained using wet lab DDH, while bypassing the associated difficulties (20–22). For ANI calculations, the genome of the query organism is split into 1-kbp fragments, which are then searched against the whole genome of a reference organism. The average sequence identity of all matches having >60% overall sequence identity over >70% of their length is defined as the ANI between the two organisms (18). dDDH uses the sequence similarity of conserved regions between two genomes of interest (23) to calculate genome-to-genome distances. These distances are converted to a dDDH value, which is intended to be analogous to DDH values obtained using traditional laboratory methods (23). There are three formulas for calculating dDDH values between two genomes using either (i) the length of all matches divided by the total genome length, (ii) the sum of all identities found in matches divided by the overall match length, and (iii) the sum of all identities found in matches divided by the total genome length, with the second formula being recommended for assigning species designations for draft genomes (24, 25). However, neither ANI nor dDDH reports the total length of fragments that match the reference genome, and problems arise when only a small number of fragments are unknowingly used. Additionally, neither method generates an alignment that can be used for complementary phylogenetic analyses, instead relying solely on strict cutoffs to define species, potentially leading to the formation of paraphyletic taxa.
Some recent phylogenomic analyses have shifted toward using the core proteome, a concatenated alignment constructed using the amino acid sequences of genes shared between the organisms of interest (26). However, differences in annotation that affect gene calls can add an unnecessary variable when deriving evolutionary relationships. Instead, we propose that such analyses should use nucleotide core genome alignments to infer phylogenetic relationships with a well-defined nucleotide identity threshold to identify where trees should be pruned to define species. Such an analysis would be enabled by genome aligners like Mauve (27) and Mugsy (28), which identify regions of nucleotide identity that are shared between genomes, which are referred to as locally collinear blocks (LCBs). For any given subset of genomes, a core genome alignment can be generated by concatenating these LCBs and retaining only positions present in all genomes. Using the length and sequence identity of the core genome alignment, paired with a phylogeny generated from this alignment, genus- and species-level taxonomic assignments can be developed. A core genome alignment-based method provides advantages to its protein-based counterpart in that it is of a higher resolution and independent of annotation, while also being transparent with respect to the data used in the calculations and very amenable to data sharing and deposition in data repositories.
The Rickettsiales are an order within the Alphaproteobacteria composed of obligate, intracellular bacteria where classic DNA-DNA hybridization is not possible and bacterial taxonomy is uneven, with each of the genera having its own criteria for assigning genus and species designations. Within the Rickettsiales, there are three major families, the Anaplasmataceae, Midichloriaceae, and Rickettsiaceae, with an abundance of genomic data being available for genera within the Anaplasmataceae and Rickettsiaceae. However, as noted above, species definitions in these families are inconsistent, and organisms in the Wolbachia genus lack community-supported species designations. While the Wolbachia community has vigorously discussed nomenclature at each of its biannual meetings over the past 20 years, thus far, the community has supported only supergroup designations. Proposed ANI- and dDDH-informed Wolbachia species definitions and nomenclature (29) lack support from the Wolbachia community for a number of reasons (see, e.g., reference 30).
The last reorganization of the Rickettsiaceae taxonomy occurred in 2001 (31), a time when there were <300 sequenced bacterial genomes (32), a limited number of Rickettsia genomes (33, 34), and no Anaplasmataceae genomes (35–38). As of 2014, there were >14,000 bacterial genomes sequenced, and with this increase in available genomic information, more-informed decisions can be made with regard to taxonomic classification (32). By using core genome alignments, we can condense whole genomes into positions shared between the input genomes and use sequence identity to infer phylogenomic relationships. We can identify sequence identity thresholds for these alignments that are consistent with classically defined bacterial species and can be overlaid on phylogenetic trees generated from the same alignment. This combined approach, which we present in this study, enables a phylogenetically informed whole-genome approach to bacterial taxonomy. We apply this approach to organisms in the Rickettsiales, including Rickettsia, Orientia, Ehrlichia, Neoehrlichia, Anaplasma, Neorickettsia, and Wolbachia.
RESULTS
Advantages of nucleotide alignments over protein alignments for bacterial species analyses.While core protein alignments are increasingly used for phylogenetics (39, 40), a core nucleotide alignment has more phylogenetically informative positions in the absence of substitution saturation (41), yielding a greater potential for phylogenetic signal. Nucleotide-based analyses outperform amino acid-based analyses at all time scales in terms of resolution, branch support, and congruence with independent evidence (42–44). Nucleotide and protein core genome alignments were constructed for 10 complete Wolbachia genomes (see Table S1 in the supplemental material), and the resulting phylogenies were compared.
Table S1
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
There are a number of whole-genome aligners that can be used to generate nucleotide core genome alignments (for an overview, see reference 45). We generated core genome alignments for the 10 Wolbachia genomes using Mauve and Mugsy, currently two of the most commonly used whole-genome aligners (27, 28). Despite the different algorithms used by each of the two programs, the lengths of the generated core genome alignments are similar, with the Mauve and Mugsy core genome alignments being 682,949 and 579,495 bp in length, respectively. However, because Mugsy has been found to be better at handling larger genome data sets (28), Mugsy was used for all subsequent core genome alignment analyses discussed. Despite the presence of large-scale genome rearrangements present between the 10 Wolbachia genomes (Fig. S1), the resulting alignment is 45.9% of the average input genome length, indicating that collinear genomes are not necessarily required to construct core genome alignments. The lack of synteny results merely in more and smaller local colinear blocks.
FIG S1



Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Using the same 10 Wolbachia genomes and the available annotation at NCBI RefSeq, a core protein alignment was generated using FastOrtho, a reimplementation of the tool OrthoMCL (78). Despite the presence of different annotation methods used for gene calls, we were able to generate a protein alignment of 77,868 positions from the concatenated alignments of 152 shared genes between the 10 genomes using the available protein amino acid sequences (Table S1). In total, the core protein alignment contained 16,241 parsimony-informative positions compared to the 124,074 such positions observed with the core nucleotide alignment, indicating a 10-fold increase in potentially informative positions (Fig. 1A). When maximum-likelihood (ML) trees from the core protein alignment and the core nucleotide alignment are compared, the two trees are quite similar in topology and branch length. However, the nodes on the ML tree generated using the core nucleotide alignment consistently have higher bootstrap support values.
Comparison of phylogenomic trees generated using the core nucleotide alignments versus protein-based alignments for 10 complete Wolbachia genomes. A maximum-likelihood phylogenomic tree generated from the core genome alignment (CGA) of 10 complete Wolbachia genomes was compared to a core protein alignment (CPA) containing 152 genes present in only one copy (A) and an alignment generated using PhyloPhlAn, containing 176 conserved proteins (B). Wolbachia supergroups A (●), B (▲), C (■), D (), E (
), and F (
) are represented in the genome subset. Shapes of the same color indicate that the multiple genomes are of the same species as defined using our determined CGASI cutoff of ≥96.8%. When comparing the ML trees generated using the core nucleotide and core protein alignments, the trees are largely similar in both topology and branch length. In the comparison between the ML trees generated using the core nucleotide alignment and PhyloPhlAn, the trees are similar except for the relationship of wRi, which is sister to wHa in the core nucleotide alignment tree and sister to wAu + wMel in the PhyloPhlAn tree. Despite differences in clustering, the core nucleotide alignment ML tree consistently has higher bootstrap values than its core protein alignment or PhyloPhlAn counterpart.
Similarly, we compared the Wolbachia core nucleotide alignment to a protein-based alignment generated using PhyloPhlAn (39). PhyloPhlAn uses a set of 400 of the most conserved proteins across diverse bacterial taxa to infer phylogenetic relationships. Across the 10 complete Wolbachia genomes, 176 of these 400 genes were identified to be present in each of the genomes. A concatenated amino acid alignment of these 176 genes yields an alignment with 2,877 positions, of which 2,447 are parsimony informative, indicating a 50-fold decrease in the number of parsimony-informative positions from that of the core nucleotide alignment and an 8-fold decrease relative to that of the core protein alignment. As with the core protein alignment, the ML trees constructed using the Wolbachia core nucleotide alignment and PhyloPhlAn have similar topologies and relative branch lengths, with the core nucleotide alignment having higher support values on its nodes (Fig. 1B). The one difference in clustering occurs within the supergroup A Wolbachia, in which wRi clusters with wHa in the core nucleotide alignment, while wRi clusters with wMel and wAu in the PhyloPhlAn ML tree. The wRi and wHa branch is supported by a bootstrap value of 100 in the core nucleotide alignment ML tree, while the wRi, wMel, and wAu branch is supported by a bootstrap value of 98.9 in the PhyloPhlAn ML tree.
Collectively, these results demonstrate that while nucleotide and protein alignments give roughly the same result, there are key differences. Core nucleotide alignments have almost an order of magnitude more parsimony-informative positions, and this results in stronger branch support values in the corresponding phylogenies.
Can core genome alignments be constructed from bacteria within a genus?To compare differences between existing classically defined species designations by this method of using sequence identity thresholds with whole-genome phylogenies, core genome alignments were constructed for seven bacterial genera representative of six different bacterial taxonomic classes (Table 1). For these seven genera, which include Arcobacter, Caulobacter, Erwinia, Neisseria, Polaribacter, Ralstonia, and Thermus, each of the generated core genome alignments were found to be ≥0.33 Mbp in size, representative of ≥13.6% of the average input genome size (Table 1; Table S2). For this diverse group of genera, the core genome alignments comprise a large portion of their input genomes, indicating that this technique is applicable to a wide range of classically defined bacterial taxa. Additionally, core genome alignments were successfully constructed from the genomes of four of the six Rickettsiales genera, including 69 Rickettsia genomes, 3 Orientia genomes, 16 Ehrlichia genomes, and 23 Wolbachia genomes (Table 1). Each of these four core genome alignments are >0.18 Mbp in length and contain >10% of the average size of the input genomes.
Core genome alignment statistics
Table S2
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Substitution saturation.A common caveat in nucleotide alignments is the presence of substitution saturation, an issue in which the phylogenetic distances reported by nucleotide alignments are underestimated due to an overabundance of reverse mutations within the organism subset (41). Compared to protein-based methods, substitution saturation more heavily impacts nucleotide-based phylogenetic distance measurements due to the higher rate of substitutions per position. However, for each core genome alignment, when the uncorrected pairwise genetic distances are plotted against the model-corrected distances, linear relationships are observed for all alignments (r2 > 0.995), indicating that substitution saturation does not hinder the ability of the core genome alignments to represent evolutionary relationships (46) (Fig. S2). Given the fact that the analyzed genera span bacterial taxonomic classes, including Alphaproteobacteria, Betaproteobacteria, Gammaproteobacteria, Epsilonproteobacteria, Flavobacteriia, and Deinococci (Table 1), we expect this result to be broadly applicable to bacterial core genome alignments.
FIG S2
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Assessment of genus designations.We found our initial core genome alignments for Anaplasma and Neorickettsia to be considerably shorter than other core genome alignments, both being ∼20 kbp and accounting for ≤2.3% of the average input genome sizes. The small size of the core genome alignment indicates that the Anaplasma and Neorickettsia genome subsets each contain a diverse set of genomes that are likely not of the same genus. Using input genomes from different genera to construct a core genome alignment yields an alignment of an insufficient size to accurately represent the evolutionary distances between the input genomes. For example, when the genome of Neoehrlichia lotoris is supplemented with the genomes used to create the 0.49-Mbp Ehrlichia core genome alignment, the resultant core genome alignment is 0.11 Mbp, representing only 8.9% of the average input genome size, compared to the prior 39.8%. Therefore, we used subsets of species to test whether the Anaplasma and Neorickettsia genera are too broadly defined. A core genome alignment generated using only the 20 A. phagocytophilum genomes in the 30 Anaplasma genome set is 1.25 Mbp and represents 83.3% of the average input genome size, while a core genome alignment of the remaining 10 Anaplasma genomes is 0.77 Mbp, 65.3% of the average genome input size. This result suggests that the Anaplasma genus should be split into two separate genera. Similarly, when the genome of Neorickettsia helminthoeca Oregon is excluded from the Neorickettsia core genome alignment, a 0.77-Mbp Neorickettsia core genome alignment is generated, 87.4% of the average input genome size, suggesting that N. helminthoeca Oregon is not of the same genus as the other three Neorickettsia genomes. For the remainder of this paper, these genus reclassifications are used. Given these collective results, we recommend that the genus classification level can be defined as a group of genomes that together will yield a core genome alignment that represents ≥10% of the average input genome sizes.
Identifying a CGASI for species delineation.Within the Rickettsia, Orientia, Ehrlichia, Anaplasma, Neorickettsia, Wolbachia, Arcobacter, Caulobacter, Erwinia, Neisseria, Polaribacter, Ralstonia, and Thermus genome subsets, ANI, dDDH, and core genome alignment sequence identity (CGASI) values were calculated for 7,264 intragenus pairwise genome comparisons (Table S3), of which 601 are between members of the same species. The ANI and CGASI follow a second-degree polynomial relationship (r2 = 0.977) (Fig. 2A). Using this model, the ANI species cutoff of ≥95% is analogous to a CGASI cutoff of ≥96.8%. dDDH and CGASI follow a third-degree polynomial model (r2 = 0.978), with a dDDH of 70% being equivalent to a CGASI of 97.6%, indicating that the dDDH species cutoff is generally more stringent than the ANI species cutoff (Fig. 2B).
ANI, dDDH, and CGASI correlation analysis. CGASI, ANI, and dDDH values were calculated for 7,264 intragenus pairwise comparisons of genomes for Rickettsia, Orientia, Ehrlichia, Anaplasma, Neorickettsia, Wolbachia, Caulobacter, Erwinia, Neisseria, Polaribacter, Ralstonia, and Thermus. (A) The CGASI and ANI values for the intragenus comparisons follow a second-degree polynomial model (r2 = 0.977), with the ANI species cutoff of ≥95% being equivalent to a CGASI of 96.8%, indicated by the blue dashed box. (B) The CGASI and dDDH values for all pairwise comparisons follow a third-degree polynomial model (r2 = 0.978), with the dDDH species cutoff of ≥70% being equivalent to a CGASI of 97.6%, indicated by the red dashed box. (C) To identify the optimal CGASI cutoff to use when classifying species, for each increment of the CGASI species cutoff plotted on the x axis, the percentage of intraspecies and interspecies comparisons correctly assigned was determined based on classically defined species designations. The ideal cutoff should maximize the prediction of classically defined species for both interspecies and intraspecies comparisons. The ANI-equivalent CGASI species cutoff is represented by the blue dashed line, while the dDDH-equivalent CGASI species cutoff is represented by the red dashed line.
Table S3
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
The ideal CGASI threshold for species delineation would maximize the prediction of classically defined species, neither creating nor destroying the majority of the classically defined species. To examine this, all 7,264 intragenus pairwise comparisons were classified as either intraspecies or interspecies. Intraspecies comparisons are comparisons between genomes with the same classically defined species designation, while interspecies comparisons are between genomes within the same genus but with different classically defined species designations. Every possible CGASI threshold value was then tested for the ability to recapitulate these classically defined taxonomic classifications (Fig. 2C). In all cases, an abnormally high number of interspecies Rickettsia comparisons were found above both the established ANI and the dDDH species thresholds, consistent with previous observations that guidelines for establishing novel Rickettsia species are too lax (47), and as such they were excluded from this specific analysis. Below a CGASI of 97%, classically defined species begin to be separated, while organisms classically defined as different species begin to be collapsed. This coincides with the above-calculated ANI-equivalent threshold but differs from the above-calculated dDDH equivalent (Fig. 2C). The dDDH-equivalent CGASI threshold of 97.6% failed to predict the classically defined taxa from 100 intraspecies comparisons (Table S3) while the ANI-equivalent CGASI threshold of 96.8% failed to predict the classically defined taxa from 41 intraspecies comparisons (Table S3).
Given these results, we selected the ANI-equivalent CGASI value of ≥96.8% to further analyze these taxa. Overall, there were 41 pairwise comparisons of organisms across diverse taxa, including Arcobacter, Caulobacter, Neisseria, Orientia, Ralstonia, and Thermus, that are classically defined as the same species but would be classified as distinct species when assessed using our suggested CGASI threshold (Table S3). Additionally, there were 782 pairwise comparisons of organisms classically defined as different species that this threshold suggests should be the same species, of which only 10 were not in the genus Rickettsia, instead belonging to the Arcobacter and Caulobacter genera (Table S3).
Rickettsiaceae phylogenomic analyses.(i) Rickettsia. The Rickettsiaceae family includes two genera, the Rickettsia and the Orientia, and while both genera are obligate intracellular bacteria, Rickettsia genomes have undergone more reductive evolution, having a genome size ranging from 1.1 to 1.5 Mbp (48) compared to the 2.0- to 2.2-Mbp size of the Orientia genomes (49). Of the Rickettsiales, the Rickettsia genus contains the greatest number of sequenced genomes and named species, containing 69 genome assemblies in ≤100 contigs representing 27 unique species. Rickettsia genomes are currently classified based on the Fournier criteria, an MLST approach established in 2003 based on the sequence similarity of five conserved genes: the 16S rRNA gene, citrate synthase gene (gltA), and three surface-exposed protein antigen genes (ompA, ompB, and gene D) (50). To be considered a Rickettsia species, an isolate must have a sequence identity of ≥98.1% to the 16S rRNA and ≥86.5% to gltA in at least one preexisting Rickettsia species. Within the Rickettsia, using ompA, ompB, and gene D sequence identities, the Fournier criteria also support the further classification of Rickettsia species into three groups: the typhus group, the spotted fever group (SFG), and the ancestral group (50). However, the Fournier criteria have not yet been amended to classify the more recently established transitional Rickettsia group (51), indicating a need to update the Rickettsia taxonomic scheme.
A total of 69 Rickettsia genomes representative of 27 different established species were used for ANI, dDDH, and core genome alignment analyses. Regardless of the method used, a major reclassification is justified (Fig. 3; Table S2). A core genome alignment constructed using the 69 Rickettsia species genomes yielded a core genome alignment size of ∼0.56 Mbp, 42.4% of the average lengths of the input Rickettsia genomes (Table 1). For 42 of the 44 SFG Rickettsia genomes, the CGASI between any two genomes is ≥98.2%, well within the proposed CGASI species cutoff of ≥96.8% (Fig. 3), while the CGASI is ≤97.2% in the ancestral and typhus groups. If a CGASI cutoff of 96.8% is used to reclassify the Rickettsia species, all but two of the SFG Rickettsia genomes would be classified as the same species (Fig. 3), with the two remaining SFG Rickettsia genomes, Rickettsia monacensis IrR Munich and Rickettsia sp. strain Humboldt, being designated as the same species. This is consistent with ANI results as well, while dDDH yields conflicting results (Fig. 3). For the transitional group Rickettsia, Rickettsia akari and Rickettsia australis would be collapsed into a single species due to having CGASI values of 97.2% in a comparison with one another. Similarly, Rickettsia asembonensis, Rickettsia felis, and Rickettsia hoogstraalii, all classified as transitional group Rickettsia, would be collapsed into another species, all having CGASI values of 97.2% in comparisons with one another. This is consistent with a phylogenomic tree generated from the Rickettsia core genome alignments, where the SFG Rickettsia genomes have far less sequence divergence than the rest of the Rickettsia genomes (Fig. 3).
Analysis of the ANI, dDDH, and CGASI values of 69 Rickettsia genomes. Across 69 Rickettsia genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each pairwise genome comparison. The shape next to each Rickettsia genome represents whether the genome originates from an ancestral (⚫), transitional (), typhus group (▲), or spotted fever group (■) Rickettsia species, while the colors of the shapes on the axes represent species designations as determined by a CGASI cutoff of ≥96.8%. (C) An ML phylogenomic tree with 1,000 bootstraps was generated using the core genome alignment. (D) The relationships in the green box in panel C cannot be adequately visualized at the necessary scale, so they are illustrated separately with a different scale. For both trees, red branches represent branches with <100 bootstrap support.
(ii) Orientia.The organisms within Orientia have no standardized criteria to define novel species. There are far fewer high-quality Orientia genomes than Rickettsia genomes, which is due partly to the large number of repeat elements found in Orientia genomes, with the genome of Orientia tsutsugamushi being the most highly repetitive sequenced bacterial genome to date, with ∼42% of its genome being comprised of short repetitive sequences and transposable elements (52).
The Orientia core genome alignment was constructed using three O. tsutsugamushi genomes and is 0.97 Mbp in size, ∼47.6% of the average input genome size, with CGASI values ranging from 96.3 to 97.2% (Fig. S3). Reclassifying the Orientia genomes using a CGASI species cutoff of 96.8% would result in O. tsutsugamushi Gillliam and O. tsutsugamushi Ikeda being classified as species separate from O. tsutsugamushi Boryong (Fig. S3). In this case, this reclassification would not be consistent with recommendations from using ANI or dDDH. We suspect that ANI is strongly influenced by the large number of repeats in the genome due to ANI calculations being based off the sequence identity of 1-kbp query genome fragments. In comparison, we do not anticipate that whole-genome alignments would be confounded by the repeats. While the LCBs may be fragmented by the repeats, creating smaller syntenic blocks, the nonphylogenetically informative repeats are eliminated from an LCB-based analysis.
FIG S3
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Anaplasmataceae phylogenomic analyses.(i) Ehrlichia. Within the Anaplasmataceae, species designations are frequently assigned based on sequence similarity and clustering patterns from phylogenetic analyses generated using the sequences of genes such as 16S rRNA, groEL, and gltA (53). For example, the species designations for Ehrlichia khabarensis and Ehrlichia ornithorhynchi are justified based on having a lower sequence similarity for 16S rRNA, groEL, and gltA (53–55).
A core genome alignment constructed using 16 Ehrlichia genomes, representative of four defined species, yields a 0.49-Mbp alignment, which equates to 39.8% of the average Ehrlichia genome size (1.25 Mbp) (Table 1). Using a CGASI species cutoff of 96.8%, the Ehrlichia chaffeensis and Ehrlichia ruminantium genomes were recovered as monophyletic and well-supported species, which is consistent with ANI and dDDH results (Fig. 4). The two Ehrlichia muris and Ehrlichia sp. strain Wisconsin_h genomes have CGASI values of >97.8%, indicating that the three genomes represent one species, which is consistent with ANI but not dDDH results (Fig. 4). The genomes of Ehrlichia sp. strain HF and E. canis Jake do not have CGASI values of ≥96.8% with any other species, confirming their status as individual species, identical to the results found with ANI and dDDH (Fig. 4).
Analysis of the ANI, dDDH, and CGASI values of 16 Ehrlichia genomes. For 16 Ehrlichia genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each pairwise genome comparison. The colors of the shapes next to each Ehrlichia genome represent species designations as determined by a CGASI cutoff of ≥96.8%. (C) An ML phylogenomic tree with 1,000 bootstraps was generated using the core genome, with red branches representing branches with <100 bootstrap support.
(ii) Anaplasma.A novel Anaplasma species is currently defined based on phylogenetic analyses involving 16S rRNA, gltA, and groEL, with a new species having a lower sequence identity and a divergent phylogenetic position relative to established Anaplasma species (56–59). A core genome alignment constructed using 30 Anaplasma genomes yielded a 20-kbp alignment, 1.4% of the average input genome size. As noted before, such low values are indicative of more than one genus being represented in the taxa that were included in the analysis. Thus, CGASI analyses for Anaplasma species were done on two Anaplasma genome subsets, one containing the 20 Anaplasma phagocytophilum genomes and the other containing 9 Anaplasma marginale genomes and 1 Anaplasma centrale genome (Table 1).
A 1.25-Mbp core genome alignment, consisting of 39.8% of the average input genome size, was constructed using the 20 A. phagocytophilum genomes. All 20 genomes have CGASI values of ≥96.8%, supporting their designation as members of a single species (Fig. 5). This is supported by ANI, but dDDH again yields conflicting results (Fig. 5). The core genome alignment generated using the remaining 10 Anaplasma genomes yields a 0.77-Mbp core genome alignment, 65.3% of the average input genome size. While the A. marginale genomes all have CGASI values of ≥96.8% in comparisons with one another, the Anaplasma centrale genome has CGASI values ranging from 90.7 to 91.0% compared to the nine A. marginale genomes, supporting A. centrale as a separate species from A. marginale, consistent with existing taxonomy and with the ANI and dDDH species cutoffs (Fig. 5).
Analysis of the ANI, dDDH, and CGASI values of 30 Anaplasma genomes. (A) For 30 Anaplasma genomes, the ANI and dDDH values were calculated for each genome comparison and color-coded to illustrate the results with respect to ANI cutoffs of ≥95% and dDDH cutoffs of ≥70%. (B) When we attempted to construct a core genome alignment using all 30 Anaplasma genomes, only a 20-kbp alignment was generated, accounting for <1% of the average Anaplasma genome size. Therefore, CGASI values were calculated after the Anaplasma genomes were split into two subsets containing 20 A. phagocytophilum genomes and the remaining 10 Anaplasma genomes. In all panels, the shape next to each genome denotes genus designations as determined by the size of their core genome alignments, while the color of the shape denotes the species as defined by a CGASI cutoff of ≥96.8%.
(iii) Neorickettsia.The Neorickettsia genus contains four genome assemblies: Neorickettsia helminthoeca Oregon, Neorickettsia risticii Illinois, Neorickettsia sennetsu Miyayama, and Neorickettsia sp. strain 179522. The genus was first established in 1954 with the discovery of N. helminthoeca (60). In 2001, N. risticii and N. sennetsu, both initially classified as Ehrlichia strains, were added to the Neorickettsia based on phylogenetic analyses of 16S rRNA and groESL (31). A core genome alignment constructed using all four Neorickettsia genomes yields a 20-kbp alignment, 2.3% of the average input genome size. When excluding the genome of N. helminthoeca Oregon, the three remaining Neorickettsia genomes form a core genome alignment of 0.76 Mbp in size, 87.4% of the average input genome size. The sequence identity of the core genome alignment indicates that N. risticii Illinois, N. sennetsu Miyayama, and Neorickettsia sp. 179522 are three distinct species within the same genus, while the length of the core genome alignment indicates that N. helminthoeca Oregon is of a separate genus (Fig. S4). When assessing the Neorickettsia species designations using ANI and dDDH cutoffs, the four Neorickettsia genomes can only be determined to be different species, as the two techniques are unable to delineate phylogenomic relationships at the genus level.
FIG S4
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
(iv) Wolbachia.The current Wolbachia classification system lacks traditional species designations and instead groups organisms by supergroup designations using an MLST system consisting of 450- to 500-bp internal fragments of five genes: gatB, coxA, hcpA, ftsZ, and fbpA (61). A core genome alignment generated using 23 Wolbachia genomes yields a 0.18-Mbp alignment, amounting to 14.8% of the average input genome size.
Among filarial Wolbachia supergroups C and D, the CGASI cutoff of 96.8% would split each of the traditionally recognized supergroups into two groups each, which is also supported by ANI and dDDH. While wOo and wOv would be the same species, wDi Pavia should be considered a different species. Similarly, wBm and wWb would be considered the same species, while wLs should be designated a separate species. The wCle, wFol, and wPpe endosymbionts from supergroups E, F, and L, respectively, would all be considered distinct species using CGASI, ANI, or dDDH.
The CGASI values between the supergroup A Wolbachia organisms, apart from wInc SM, have CGASI values of ≥96.8 (Fig. 6). The genome of wInc SM has CGASI values ranging from 94.6% to 95.9% compared to other supergroup A Wolbachia genomes, indicating that if species assignments were dependent solely on a CGASI cutoff, then wInc SM would be a different species. Because >10% of the positions in the wInc SM genome are ambiguous nucleotide positions, this leads to a large penalty in sequence identity scores, as seen with CGASI, ANI, and DDH (Fig. 6). However, a phylogenomic tree generated using the Wolbachia core genome alignment indicates that wInc SM is nested within the other supergroup A Wolbachia taxa, a clade with 100% bootstrap support (Fig. 6). Similarly, the results between CGASI, ANI, and dDDH in supergroup B are discordant. Five of the six supergroup B Wolbachia genomes, all except for wTpre, have CGASI values of ≥96.8% compared to one of the other five, indicating the five genomes are likely of the same species. However, despite wTpre being considered a different species if the CGASI cutoff of 96.8% is used, a phylogenomic tree constructed using the core genome alignment shows wTpre to be nested within the supergroup B Wolbachia organisms (Fig. 6).
Analysis of the ANI, dDDH, and CGASI values of 23 Wolbachia genomes. For 23 Wolbachia genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each pairwise genome comparison. The shape next to each Wolbachia genome represents supergroup A (⚫), B (▲), C (■), D (), E (
), F (
), and L (◆) designations, while the color of the shape indicates species as defined by a CGASI cutoff of ≥96.8%. (C) An ML phylogenomic tree was generated using the Wolbachia core genome alignment constructed using 23 Wolbachia genomes, with red branches representing branches with <100 bootstrap support. The species designations in supergroup B show the CGASI-designated species clusters of wPip_Pel, wPip_JHB, wAus, and wStri to be polyphyletic.
In both cases, if wInc SM or wTpre were designated distinct species from the other supergroup A and B Wolbachia, respectively, paraphyletic clades would be created. This indicates the importance of using a phylogenomic analysis in addition to a threshold. In this case, the core genome alignment can easily be used with robust phylogenetic algorithms to generate both network trees and ML trees. For both cases, despite wInc SM and wTpre having <96.8% CGASI, the phylogenies reveal that these organisms should be considered the same species as the other supergroup A and B Wolbachia species, respectively.
Other taxa.The power of an approach that combines phylogenetics with a nucleotide identity threshold is similarly observed with Neisseria. A 0.33-Mbp core genome alignment was generated from 66 Neisseria genomes, accounting for 14.7% of the average input genome size. The core genome alignment largely supports the classically defined species, including Neisseria meningitidis, Neisseria gonorrhoeae, Neisseria weaveri, and Neisseria lactamica, although one N. lactamica isolate, N. lactamica 338.rep1_NLAC, appears to be inaccurately assigned (Fig. S5). The CGASI values suggest that each Neisseria elongata isolate is a distinct species, as are the Neisseria flavescens isolates. Like what was observed in Wolbachia species, a paraphyletic clade is observed when using a CGASI cutoff of 96.8%, with the genome of Neisseria sp. strain HMSC061B04 being nested within a clade of two Neisseria mucosa genomes, N. lactamica 338.rep1_NLAC, and several unnamed Neisseria taxa while not having a CGASI of ≥96.8% compared to any other Neisseria genome (Fig. S5).
FIG S5
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
DISCUSSION
Workflow for the taxonomic assignment of new genomes at the genus and species levels.These results highlight that while sequence identity cutoffs derived using nucleotide-based pairwise comparisons are important when delineating species, they should be coupled with an analysis of phylogenetic trees. Yet, phylogenetic trees alone are inadequate, as they cannot identify the level of branching where a species should be defined. However, the two results can be quite complementary when applied to core genome alignments and together make for a robust approach.
Given our results, we propose a workflow using three criteria to guide the taxonomic assignment of novel genomes at the genus and species levels (Fig. 7). Starting with a novel, sequenced query genome, the most closely related genus of the query genome should be identified using homology-based search algorithms (e.g., BLASTN searches of 16S rRNA). Once identified, a set of trusted genomes within the genus should be combined with this query genome to generate a core genome alignment using a tool such as Mauve (27) or Mugsy (28). These trusted genomes would be the type genomes, analogous to type strains. Unlike type strains, they are digitally recorded and thus highly unlikely to be lost. However, multiple organisms with different genome sequences could now serve in the capacity of type strains, eliminating the possibility that a single unusual type strain could unduly influence the taxonomy.
Workflow for the taxonomic assignment of a novel genome at the genus and species levels. The proposed workflow for assigning genus- and species-level designations using core genome alignments is based on three criteria: (i) the length of the core genome alignment, (ii) the sequence identity of the core genome alignment, and (iii) phylogenomic analyses. Using a query genome and a set of trusted genomes from a single genus, a core genome alignment is generated. The length of the core genome alignment is the first criterion that is used as the genus-level cutoff, with a core genome alignment size of ≥10% of the average input genome size indicating that all genomes within the subset are of the same genus. Provided that all genomes in the subset are of the same genus, the second criterion is the sequence identity of the core genome alignment, with genomes sharing ≥96.8% similarity being designated the same species. The third and final criterion uses a phylogenomic tree generated using the core genome alignment to check for paraphyletic clades. If the query genome has <96.8% core genome alignment sequence identity with any other genome and its designation as a new species does not form a paraphyletic clade, the genome should be considered a novel species.
From the core genome alignment, the genus assignment would be considered validated if the length of the core genome alignment is ≥10% of the average input genome size. Subsequently, a pairwise sequence identity matrix is calculated from the core genome alignment, identifying all genomes with ≥96.8% sequence identity. Finally, phylogenomic trees are generated with the core genome alignment and overlaid with the CGASI results. A query genome is deemed a novel species if it is <96.8% identical to the genomes of the type strains and the clades remain monophyletic. Importantly, whole-genome sequences are not needed to assign an existing genus and species designation to a new isolate, merely to describe new genera or species. Community-derived species-specific MLST schemes may be well suited for assigning isolates to already-named and -described species.
The genomics and data era may necessitate increased flexibility or a modification of the rules in bacterial nomenclature.The International Code of Nomenclature of Prokaryotes (62) is a frequently updated governance that includes 65 rules that regulate bacterial nomenclature, with the laudable goal of maintaining stability in bacterial names. These rules are designed to assess the correctness of bacterial names, as well as procedures for creating new names, but are agnostic to the methods or criteria used to delineate genera or species (62).
However, modern methods (i.e., genome sequencing, bioinformatics, data sharing, and isolate repositories) have the potential to profoundly and negatively affect bacterial nomenclature in ways where the current code may need to adjust and adapt. Historically, isolates were frequently discovered, described, and named by the same individual or group of individuals. Conflicts arose over time (e.g., two named species turned out to be the same species), and many of the rules focus on resolving those inevitable conflicts. One principle in the code for resolving conflicts relies on using the first published name. However, with modern methods, it would be fairly simple to conduct a comprehensive bioinformatics analysis and be the first to publish names of organisms, without any input from the relevant research communities, including individuals who have spent their scientific careers working on unnamed organisms. In addition, with extensive genomic data, there have been increased efforts to harmonize taxonomy that will lead to large changes in bacterial nomenclature. While choosing the first published name may be effective at resolving small conflicts, it may not be ideal for larger taxonomic changes that may be necessary and warranted. In these cases, the code may need to prioritize large collaborative community-supported efforts that include all relevant stakeholders to reform the nomenclature in a meaningful and important way, particularly efforts that reflect the recent changes in the ways that we think about bacteria and are reflective of the modern conceptual revolutions to which Stephen Jay Gould referred.
Furthermore, a complete genome sequence fully describes an organism, including its phenotypic potential. As such, a genome-informed taxonomic scheme, particularly one that relies on groups of genome type strains, as described here, may render the “Candidatus” designation obsolete. “Candidatus” was originally proposed as a category or rank given to a potential new taxon that has only sequence data and a morphology determined through microscopy with a nucleic acid probe (63). At that time, this precluded the naming of a species with only a single gene sequence, like 16S rRNA, which could be a chimera. While “Candidatus” had an important role at that time and prior to the widespread use of whole-genome sequencing, it has led to extensive confusion among researchers, particularly those new to fields with organisms that use this naming scheme. It also places an undue burden on researchers that study obligate host-associated organisms that are often very highly characterized but unamenable to growth on laboratory media and long-term storage in culture collections. Additionally, among endosymbionts and other obligate intracellular organisms, it has been applied inconsistently, compounding this confusion. Therefore, it seems reasonable that its use should be limited to only organisms from which a complete genome sequence cannot be obtained consistently from the same host organism or host cell line.
Limitations.A current issue of whole-genome aligners, such as Mauve or Mugsy, is the inability for the software to scale, being able to computationally handle only subsets of at most ∼80 genomes. While the aligner ParSNP in the Harvest suite can generate a core genome alignment from a larger subset of genomes, its use is limited to intraspecies comparisons (45), making it currently unamenable to this approach. However, for heavily sequenced genera, such as the Rickettsia genus, future species assignments for novel genomes do not require constructing a core genome alignment from every available genome. Instead, we recommend that for each genus, the relevant experts in the community establish and curate a set of trusted genomes with at least one representative of each named species that should be used for constructing core genome alignments. After an initial assessment, refinement could be made by using a core genome alignment with many more genomes from closely related species. As an example, in the case of Rickettsia species, the initial alignment would include representative genomes from each of the ancestral, typhus, spotted fever, and transitional groups. If the analysis indicated that the query was closest to typhus group genomes, a subsequent alignment would include a curated subset of genomes only from the typhus group. The first core genome alignment would serve to classify a genome into a subgroup, while the second would be used to assign a species designation.
Conclusions.As noted with the development of MLST (64), sequence data have the advantage of being incredibly standardized and portable. While MLST methods allow for insight at the subspecies level of taxonomic classifications and are heavily relied upon during infectious disease outbreaks, they do not provide sufficient resolution to define species. Increased resolution is required for taxonomy and can be obtained through the construction of whole-genome alignments that maximize the number of evolutionarily informative positions.
By generating core genome alignments for different genera, we sought to identify and develop a universal, high-resolution method for species classification. Using the core genome alignment, a set of genomes can be assessed based on the length, sequence identity, and a phylogenomic tree of the same core genome alignment. The length of the core genome alignment, which reflects the ability of the genomes to be aligned, is informative in delineating genera. Meanwhile, the pairwise comparisons of core genome alignment sequence identity can be used to delineate species, which can in turn be further refined with a phylogenetic analysis to resolve paraphyletic clades in the taxonomy.
Through this work, we have identified criteria that reconstruct classical species definitions using a method that is transparent and portable. In the course of this work, we have identified modifications that need to be made to the species and genus designations of a number of organisms, particularly within the Rickettsiales. While we have identified these instances, we recommend that any changes in nomenclature be addressed by collaborative teams of experts in the respective communities.
MATERIALS AND METHODS
Synteny plots.For the 10 complete Wolbachia genomes, NCBI BLAST v2.2.26 with the blastn algorithm was used to obtain the coordinates of the links for each of the pairwise genome comparisons. Synteny plots were generated and visualized using the Artemis Comparative Tool viewer (65, 66).
Core genome alignments.The exact commands for generating a core genome alignment from a directory containing all of the fasta files are provided in a bash script (see Text S1 in the supplemental material); the local paths for the bin directors for Mugsy and mothur need to be specified in the script. More specifically, genomes used in the taxonomic analyses were downloaded from NCBI’s GenBank (67). OrthoANIu v1.2 (68) and USEARCH v6.1.544 (69) were run with default settings and used for ANI calculations. GGDC v2.1 (ggdc.dsmz.de) (25) paired with the recommended BLAST+ alignment tool (70) was used to calculate dDDH values. For analyses in this paper, all dDDH calculations were performed using dDDH formula 2 due to the usage of draft genomes in taxonomic analyses (24). For each of the genome subsets, core genome alignments were generated using Mugsy v1.2 (28), with default settings, or Mauve v2.4 (28), with the progressive Mauve algorithm, retaining only LCBs of >30 bp in length that are present in all genomes. mothur v1.22 was used to process each of the core genome alignments to remove positions not present in all organisms in the genome subset (71). Sequence identity matrices for the core genome alignments were created using BioEdit v7.2.5 (brownlab.mbio.ncsu.edu/JWB/papers/1999Hall1.pdf). ML phylogenomic trees with 1,000 bootstraps were calculated for each core genome alignment using IQ-TREE v1.6.2 (72) paired with ModelFinder (73) to select the best model of evolution and UFBoot2 (74) for fast bootstrap approximation. Trees were visualized and annotated using iTOL v4.1.1 (75). Construction of neighbor-network trees was done using the R packages ape (81) and phangorn (77).
Text S1
Copyright © 2018 Chung et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Core protein alignments.Orthologs between complete genomes of the same species were determined using FastOrtho, a reimplementation of OrthoMCL (78) that identifies orthologs using all-by-all BLAST searches, run with default settings. The amino acid sequences of proteins present in all organisms in only one copy were aligned with MAAFT v7.313 FFT-NS-2, using default settings (79). For every protein alignment, the best model of evolution was identified using ModelFinder (73), and phylogenomic trees were constructed using an edge-proportional partition model (80) with IQ-TREE v1.6.2 (72) and UFBoot2 (74) for fast bootstrap approximation. PhyloPhlAn v0.99 was run using default settings as an alternative method of assessing protein-based phylogenies (39). Comparative analyses of core protein and core nucleotide alignment trees were performed using the R packages ape (76) and phangorn (77).
ACKNOWLEDGMENTS
We thank Joseph J. Gillespie for helpful discussions and encouragement.
This work was funded by the National Institute of Allergy and Infectious Diseases (grant U19AI110820).
FOOTNOTES
- Received October 9, 2018.
- Accepted October 11, 2018.
- Copyright © 2018 Chung et al.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵