Machine Learning Leveraging Genomes from Metagenomes Identifies Influential Antibiotic Resistance Genes in the Infant Gut Microbiome

The process of reconstructing genomes from environmental sequence data (genome-resolved metagenomics) allows unique insight into microbial systems. We apply this technique to investigate how the antibiotic resistance genes of bacteria affect their ability to flourish in the gut under various conditions. Our analysis reveals that strain-level selection in formula-fed infants drives enrichment of beta-lactamase genes in the gut resistome. Using genomes from metagenomes, we built a machine learning model to predict how organisms in the gut microbial community respond to perturbation by antibiotics. This may eventually have clinical applications.

resistance genes (3) and may be involved in the spread of resistance genes to pathogens (4)(5)(6). Additionally, antibiotics are often prescribed to treat infections without considering how the drug will affect the gut microbial community, which can lead to negative consequences for the human host (7). It is therefore important to study how the antibiotic resistance genes harbored by organisms in the gut microbiome impact community dynamics.
The preterm infant gut resistome is considered a research priority because premature infants are almost universally administered antibiotics during the first week of life (8). Early life is a critically important time for community establishment (9), and neonatal antibiotic therapies have both transient and persistent effects on the gut microbial community. Included among the many ways in which antibiotics have been shown to affect the microbiome are lower bacterial diversity (10), enrichment of Enterobacteriaceae (10,11), reduction of Bifidobacterium spp. (12), and enrichment of antibiotic resistance genes (13), including those that have no known activity against the particular antibiotic administered (14). Previous studies have shown that the community composition of the infant microbiome is affected by diet, with artificial formula selecting for Escherichia coli and Clostridium difficile (15) and breast milk selecting for certain strains of Bifidobacterium (16). Hypotheses regarding the effect of birth mode on the microbiome are contested, with most studies finding that it has an effect on the gut microbiome (17)(18)(19) whereas some show no effect (20,21). Gender (22) and maternal antibiotics administered before or during birth (23)(24)(25) also influence microbiome assembly.
Here we used genome-resolved metagenomics coupled with statistical and machine learning approaches to investigate the gut resistome of 107 longitudinally sampled premature infants. We show that certain antibiotic resistance genes in particular genomes affect how clinical factors influence the gut microbiome and, in turn, how the antibiotic resistance capabilities of a gut organism influence its growth and relative abundance.

RESULTS AND DISCUSSION
Antibiotic resistance of the premature infant microbiome. A total of 107 premature infants were studied during the first 3 months of life. The median birth weight was 1,228 g (interquartile range [IQR] ϭ 902 to 1,462), with 35% of the infants having extremely low (Ͻ1,000 g) birth weight and 65% of infants having birth weight of Ͼ1,000 g (see Table S1 and Text S1 in the supplemental material). Birth weight is closely linked to gestational age, which is divided into the following categories: late preterm (34-week to Ͻ37-week gestation), moderate preterm (32-week to Ͻ34-week gestation), very preterm (28-week to Ͻ32-week gestation), and extremely preterm (Ͻ28 weeks gestation) (26). Among the infants in this study, 30% were extremely preterm; such infants tend to have significant health problems, including higher rates of necrotizing enterocolitis and extreme dysbiosis of the microbiota (27). The majority (60%) of the infants in our study were classified as very preterm, just 10% of our infants were classified as moderate preterm, and no infants were classified as late preterm (Table S1 and Text S1). Because the infants in this study were mostly very or extremely preterm, it should be noted that the biological characteristics reported here are highly divergent from those of typical full-term infants (28).
Longitudinal sampling of each infant resulted in a total of 902 samples that were sequenced and analyzed. All 107 infants received gentamicin and ampicillin during the first week of life, and 36 of those infants received additional antibiotics in the later weeks due to disease (Table 1). Data on the types of antibiotics given to the infants, along with the day of life (DOL) on which they were administered, are available in Table S2. All samples were subjected to Illumina short-read shotgun sequencing, and the sequence data were assembled using idba-ud (see Materials and Methods for details). Binning resulted in a dereplicated set of 1,483 genomes (Table S3). The taxonomic composition of these samples is typical for the premature infant gut (Fig. 1A and B). Resfams (29) annotations of predicted amino acid sequences from the resulting scaffolds (Table S4) revealed that the most abundant resistance mechanisms were resistance-nodulation-cell division (RND) efflux pumps and ATP-binding-cassette (ABC) transporters ( Fig. 1C and D). Note that, in addition to their ability to contribute to antibiotic resistance, efflux pumps and transporters have been associated with stress response (30)(31)(32) and may reflect a rapidly changing environment during the first few months of life. For the infants that did not receive additional antibiotics (Fig. 1C), a decreasing trend in total antibiotic resistance potential was observed over time (P Ͻ 0.005) (Text S1). During the first week of life, empirical antibiotic therapy perturbs the microbiome by preferentially enriching for antibiotic-resistant organisms. This is consistent with prior results showing temporarily elevated resistance gene levels after administration of antibiotics (17). Microbial community recovery begins following this period. For the infants that received antibiotics after the first week of life (Fig. 1D), there was no consistent trend of decreasing resistance potential (Text S1). This suggests that administration of antibiotics to premature infants after the first week of life can prolong the enrichment of the resistome.
Approximately 20% of the resistance genes annotated by Resfams (Table S4) were not assignable to specific organisms in the microbiome (Table S5 and Text S1). This was partly due to some genes being carried on plasmids, which were excluded from the genomic analysis.
Formula feeding influences the gut resistome through strain-level selection. Permutational multivariate analysis of variance (PERMANOVA) tests, which discern and isolate the effects of factors through partitioning of variance (33), were performed to investigate the effect of feeding regimen, delivery mode, gender, maternal antibiotics, and the infant's current antibiotic therapy on the resistome (Text S1). Tests were performed on the resistomes of samples taken at weeks 2, 4, and 6 to avoid the bias of repeated measures in longitudinal sampling. At week 2, the distribution of antibiotic resistance genes seen with formula-fed infants was not significantly different from that seen with infants that received breast milk. However, a difference was detected at weeks 4 and 6 (P Ͻ 0.05), accompanied by an increase in effect size as assessed by PERMANOVA F-statistic (Table S6). This signals divergence of the resistomes of formulafed and breast-fed infants over time. The PERMANOVA tests were not sensitive enough to detect any effects on the resistome resulting from delivery mode, gender, or antibiotics, which may have been because the test displays conservatism when variances are positively related to group sample size (34). Because these factors have been shown to alter the microbiota (19,22,24,25), it is unlikely that the resistome was truly unchanged. Since feeding type was the only factor that produced a detectable response, we further investigated its effects.
Random forest models were used to classify resistomes as belonging to either a formula-fed baby or a breast-fed baby, and we used the feature importance scores of the trained models to select resistance genes for further study ( Table 2). One of the four selected resistance genes was significantly associated with a feeding group: class D beta-lactamase was enriched in formula-fed infants (P Ͻ 0.05) ( Fig. 2A). Genomeresolved analysis (Table S5 and S7) revealed that class D beta-lactamase genes are most frequently carried by Clostridium difficile (Text S1). Among the 67 C. difficile genomes in the dereplicated data set, 38 harbor a class D beta-lactamase gene. Phylogenetic analysis reveals that these 38 organisms are very closely related (Fig. 2B). To ascertain if this C. difficile strain is involved in the enrichment of class D beta-lactamase in the formula-fed infant gut resistome, the relative abundances of C. difficile with and without a class D beta-lactamase gene in the gut microbiome of breast-fed and formula-fed infants were assessed. C. difficile with a class D beta-lactamase gene was consistently more abundant than C. difficile lacking this gene among the infants that received only formula, while both types of C. difficile were low in relative abundance among the infants that received breast milk (Fig. 2C). Even with the lower relative abundance of some C. difficile strains, there was no significant difference in genome completeness and N50 between the two groups (Table S3 and Text S1), assuring us that there was no methodological issue that reduced our ability to detect beta-lactamase. Prior studies have reported an increased abundance of C. difficile in the gut microbiomes of formula-fed infants (15), but here we reveal that formula feeding enriches for a particular C. difficile strain. Class D beta-lactamase hydrolyzes beta-lactam antibiotics (35), and there is no known connection between host diet and its antibiotic resistance function. It is thus unlikely that class D beta-lactamase offers a selective advantage to organisms in the gut of formula-fed infants, but this gene may be linked to other genes that confer an advantage. Pairwise correlations of the Resfams and KEGG metabolism modules in C. difficile genomes revealed that one KEGG module, the cytidine 5=-monophospho-3deoxy-D-manno-2-octulosonic acid (CMP-KDO) biosynthesis module, was perfectly correlated with the presence of the class D beta-lactamase gene. CMP-KDO catalyzes a key reaction in lipopolysaccharide biosynthesis (36). Further inspection of the KEGG annotations revealed that only one gene from this module was present in C. difficile: the arabinose-5-phosphate isomerase gene. This gene typically occurs in Gram-negative bacteria, where it plays a role in synthesis of lipopolysaccharide for the outer membrane (37), and yet a recent study identified arabinose-5-phosphate isomerase in a Gram-positive organism, Clostridium tetani (38). Although the function of this gene in Gram-positive bacteria is unknown, it is hypothesized to be a regulator and may modulate carbohydrate transport and metabolism (38). If so, C. difficile (Gram-positive) strains with arabinose-5-phosphate isomerase may have a competitive advantage because they are able to rapidly respond to the availability of the carbohydrates that are abundant in formula. It is also possible that other, potentially unknown genes are responsible for the observed effect and that these genes may not necessarily relate to metabolism of compounds in formula. Breast-fed babies have an increased abundance of Bifidobacterium (Text S1) (16), so the ways in which different strains of C. difficile interact and compete with Bifidobacterium may contribute to the observed trend.
Major facilitator superfamily (MFS) pumps are associated with increased replication. A previous analysis revealed that antibiotic administration is associated with elevated bacterial replication index (iRep) values, which was hypothesized to be due to high resource availability after elimination of antibiotic-susceptible strains (39). Expanding upon this result, we show here that the mean iRep value for a sample in the days following antibiotic treatment is positively correlated with total resistance gene content (P Ͻ 0.05) (Fig. 3A). To be present in the period following antibiotic administration, all organisms must be antibiotic resistant; it is thus unclear why a larger inventory of resistance genes should lead to higher growth rates.
To characterize the effect of antibiotic resistance genes on iRep values in isolation from the confounding effects of antibiotics, we studied infants that did not receive any antibiotics after the first week of life. In these infants, organisms carrying genes for MFS transporters had significantly higher iRep values than those that do not have MFS genes (P Ͻ 5 ϫ 10 Ϫ5 ) (Fig. 3B). As there are known differences in median iRep values among phyla (39), the comparison was repeated within each phylum that contained members with and without MFS genes. The trend of higher iRep values for organisms with MFS was most apparent in Firmicutes (P Ͻ 5 ϫ 10 Ϫ4 ) (Fig. 3B). The genomes lacking MFS show comparatively high completeness scores (Table S3 and Text S1), suggesting that this finding is not due to missed detection of the MFS genes. Therefore, the presence of these antibiotic resistance genes appears to inherently increase replication, even when no antibiotics are being administered. This could be due to protection from antibiotics being produced at a low level by other gut organisms (40) or a result of naturally beneficial physiological functions of MFS pumps (41). We also acknowledge that this finding may simply reflect a high incidence of organisms with MFS genes present during periods of fast replication without a causal link.
A model that predicts an organism's response to vancomycin and cephalosporins. We modeled the relationship between the gene content of a gut organism and its direction of change in relative abundance (increase versus decrease) after a premature infant is administered a combination of a glycopeptide antibiotic (vancomycin) and a beta-lactam antibiotic (cephalosporin [either cefotaxime or cefazolin]). Principalcomponent analysis (PCA) was performed on Resfams (29) and KEGG (42) annotations to generate a low-dimensional representation of each organism's metabolic potential and resistance potential. The first five principal components (PCs) cumulatively ex- plained 48% of the variation in the data set. Using these PCs as input, the AdaBoost-SAMME algorithm (43) was applied, with decision tree classifiers as base estimators. The model, trained on 70% of the data, performed extremely well on the validation set, with a precision value of 1.0 and a recall value of 1.0, indicating that every genome was correctly classified. Because the validation set was utilized for testing during the preliminary stages of model development, the model was also evaluated with a final test set, with which it achieved 0.9 precision and 0.7 recall (Text S1).
Of the features that acted as the strongest contributors to each of the PCs, five genes with a tendency to occur in microbes that increase in relative abundance after antibiotic treatment were identified (Fig. 4). One of these is the subclass B2 beta-lactamase gene, which is carried by several of the organisms that persisted after antibiotics, including Enterococcus faecalis, Clostridium baratii, and Bradyrhizobium sp. (Table S5 and S7 and Text S1). Subclass B2 beta-lactamase has been shown to hydrolyze carbapenems and to display much lower levels of resistance to cephalosporins (44). Considering its substrate specificity for carbapenems, this betalactamase may not directly contribute to an organism's ability to persist after treatment with cephalosporins; rather, it may be linked to other, potentially unknown genes. However, the substrate specificity of an antibiotic resistance gene can depend on the organismal context of that gene (45), and a single base substitution in a beta-lactamase gene can alter substrate specificity (46), so the possibility that beta-lactamases falling into the B2 subclass may confer resistance to cephalosporins in some gut organisms should not be discounted. Furthermore, our model shows that a gene linked to vancomycin resistance, vanR, is among the genes predictive of an organism's propensity to increase in relative abundance after antibiotic treatment (Fig. 4). VanR is the transcriptional activator of an operon harboring genes involved in peptidoglycan modification (VanH, VanA, and VanX), which prevents vancomycin from binding to its target (47). This gene cluster usually resides on plasmids (48,49). VanR, encoded by a gene that is essential for the initiation of the activity of the vancomycin resistance operon promoter (50), was chromosomally present in several genomes of organisms that increased in abundance after treatment with antibiotics, such as Enterococcus faecalis and Clostridium perfringens (Table S5 and S7 and Text S1). Because our genomic analysis precluded the assignment of genes on plasmids, VanR was the best indicator of resistance.
In addition to genes specifically encoding resistance to beta-lactams or glycopeptides, efflux pumps and transporters were also strong contributors to the PCs used as The tendency of genes to occur in the class of genomes that increased in relative abundance after antibiotics. Genes and modules strongly contributing to the principal components used in the machine learning model were identified, and class tendency was calculated using the ratio of the gene's prevalence in the increased-abundance group to its prevalence in the decreased-abundance group. Genes associated with the increased-abundance class of genomes are colored red. input to the model. Mex genes (of the resistance-nodulation-cell division family of drug efflux pumps) and ATP-binding cassette (ABC) transporter genes were associated with microbes that increase in relative abundance after treatment with antibiotics (Fig. 4). Multidrug efflux pumps are essential for the intrinsic drug resistance of many bacteria, and overexpression of the genes for these pumps leads to elevated resistance levels (51). Bacteroides ovatus and Bacteroides helcogenes carried multiple copies of Mex efflux pumps, while Enterococcus faecalis and Clostridium baratii harbored several ABC transporter genes (Table S5 and S7 and Text S1). Although the genomes of these organisms also encoded target-specific resistance genes such as the subclass B2 beta-lactamase, the more general pumps and transporters likely enhanced their ability to flourish after antibiotic treatment.
Previous studies have utilized data from 16S rRNA gene amplicon sequencing or read-based metagenomics of the human microbiome to predict life events and disease states of the human host using machine learning or other modeling techniques (52,53). However, read-based metagenomics lacks resolution at the genomic level, and, due to strain-level differences in antibiotic resistance (54), taxonomy data from marker gene studies cannot be used to predict how particular organisms in a community will respond to antibiotics. Here, for the first time, we utilized the data obtained by reconstructing genomes from metagenomes to make predictions about the future states of individual gut microbes. This has tremendous potential for application in the fields of medicine and microbial ecology. For example, such a model can be used before administering drugs to a patient to verify that a particular combination of antibiotics will not lead to overgrowth of an undesirable microbe. Our report serves as a proof of concept for this application of machine learning used in conjunction with genome-resolved metagenomics to derive biological insight.

MATERIALS AND METHODS
Sample collection, sequencing, assembly, and gene prediction. Fecal samples were collected from 107 infants that resided in the neonatal intensive care unit (NICU) at the Magee-Women's Hospital in Pittsburgh, PA, during the sampling period. Briefly, DNA was extracted using a PowerSoil DNA isolation kit (Mo Bio Laboratories, Carlsbad, CA) and sequenced using an Illumina HiSeq platform. Details on sample recovery, extraction, library preparation, and sequencing have been previously reported (55)(56)(57). Using default parameters for all the programs, the reads were trimmed with Sickle (https://github.com/ najoshi/sickle), cleared of human contamination following mapping to the human genome with Bowtie2 (58), and assembled with idba_ud (59). Additionally, idba_ud was used to generate coassemblies for each infant by simultaneously assembling all the samples belonging to the infant. Prodigal (60) run in the metagenomic mode was used for gene prediction.
Genome recovery and calculation of relative abundances. For each infant, reads from all samples from that infant were mapped to all individual assemblies from that infant as well as to the coassembly for the infant using SNAP (61). Coverage of scaffolds was calculated and used to run concoct (62) with default parameters on all individual assemblies and coassemblies. To remove redundant bins, all bins recovered from each infant were dereplicated using dRep (63) v0.4.0 with the following command: dRep dereplicate_wf--S_algorithm gANI -comp 50 -con 25 -str 25 -l 50000 -pa 0.9 -nc.1.
Using Bowtie2 (58), the reads from each sample were mapped to the set of genomes that were recovered from that particular infant. The read mapping output files were used to calculate the average coverage of each genome in each sample, and the coverage values were converted to relative abundance values by utilizing the read length, total number of reads in the sample, and genome length.
iRep calculation. For each sample, a set of representative genomes was first chosen from the complete collection of dereplicated genomes. First, all genomes were clustered at 98% average nucleotide identity (ANI) using dRep (63). A pangenome was then generated for each of these clusters using PanSeq (64), creating a list of fragments representing the entire sequence space of each cluster. All pangenomes of all clusters were merged, and reads from all samples were mapped to the resulting pangenome set using SNAP (61). By analyzing the coverage of all fragments in the pangenome set, the breadth of each genome in each sample was calculated (number of genome fragments Ͼ1 ϫ coverage/ total genome fragments). Genomes with less than 85% breadth were removed from analysis. For all remaining genomes, the genome from each cluster with the highest breadth was added to that sample's representative genome list.
Next, reads from each sample were mapped to its representative genome list using bowtie2 (58) default parameters. iRep (39) was run on the resulting mapping files using default parameters and without GC correction. Only values that passed iRep's default filtering and were Ͻ3 were considered for analysis.
Annotation. The amino acid sequences of genes predicted by the metaProdigal gene finding algorithm (60) were searched against Resfams (29), an antibiotic resistance gene-specific profile hidden Markov model (HMM) database, using the hmmscan function of HMMER v 3.1b2 (65). The --cut_ga option was used to set the reporting and inclusion limits as the profile-specific gathering threshold, and those limits have been manually optimized on a profile-by-profile basis to ensure Resfams prediction accuracy (29). The Resfams annotation output and the coverage of each scaffold that had a hit to a Resfams profile were used to generate sample resistance gene summaries. Each sample resistance gene summary (see Table S8 in the supplemental material), which represents the antibiotic resistance potential of a particular infant gut microbiome at a particular point in time, displays the counts per million reads (CPM) for each of the 170 antibiotic resistance gene families in the Resfams database. Additionally, genome resistance gene profiles that indicated the count of each resistance gene were developed for each genome (Table S5). Information about the database, including descriptions of the antibiotic resistance genes represented by each accession code, is available at http://www.dantaslab.org/resfams/.
To gather general metabolism data, all binned sequences were searched against the Kyoto Encyclopedia of Genes and Genomes (KEGG) (42) HMMs and the results were parsed for genome profiling. This resulted in a KEGG metabolism profile for each organism that displayed the fraction of each KEGG module encoded by that genome (Table S9).
Statistical and computational analysis. To evaluate the effect of feeding regimen, delivery mode, gender, maternal antibiotics, and the infant's current antibiotic therapy, three cross-sectional PERMANOVA (66) tests were performed for weeks 2, 4, and 6 using the adonis2 function of the vegan package in R (67). For each infant, the first sample of each week was identified and the resistance gene summary of that sample was included in the PERMANOVA. If antibiotics were being administered on the day of sampling (which also indicates a current disease diagnosis), the infant was labeled as currently receiving antibiotics. Infants that were exclusively fed breast milk and infants that were given breast milk at any point were both classified as receiving breast milk. The Bray-Curtis dissimilarity metric was used and 9,999 permutations were performed to assess the marginal effects of the terms. The factor revealed to correspond to significant differences in antibiotic resistance gene content (P Ͻ 0.05) was selected for continued analysis. To identify antibiotic resistance genes associated with either formula feeding or breast milk feeding during the weeks indicated by the PERMANOVA results, the infant's diet was used to classify sample resistance gene summaries using random forest models (68). Mann-Whitney U tests were performed on Resfams genes that had feature importance scores above 0.07 in the random forest model, as calculated by the Gini importance metric. Genomes containing resistance genes significantly associated with a particular feeding type, along with genomes of the same species lacking these genes, were further investigated. The ribosomal protein S3 (RPS3) genes for each genome were identified using rp16.py (https://github.com/christophertbrown/bioscripts/blob/master/bin/rp16.py). The RPS3 nucleotide sequences were aligned with MUSCLE (69) using default parameters, and a maximum-likelihood phylogenetic tree was built with RAxML (70). Pairwise Pearson correlations of Resfams data with KEGG modules within these genomes were calculated.
The Pearson correlation of the mean replication index (iRep) value for a sample and the sample's total resistance gene content was determined for samples collected within 5 days following antibiotic treatment. The rates of replication of organisms harboring antibiotic resistance genes were compared to those of organisms lacking resistance genes of the same category. All P values were subjected to Bonferroni correction for multiple testing.
Infants from whom a sample was taken both before and after a post-week antibiotic treatment were identified, and the before-treatment and after-treatment samples were selected (no samples were available from the period prior to the time at which the empirical antibiotics were administered during the first week). Genomes from the selected samples were identified and labeled as either increasing or decreasing in relative abundance from the preantibiotic sample to the postantibiotic sample. Using scikit-learn (68), development of a machine learning model to predict the direction of change in relative abundance for each genome based on its Resfams and KEGG metabolism data was attempted, and yet an adequate model could not be developed, presumably due to variations in the ways in which organisms respond to different antibiotic combinations. Therefore, the data set was narrowed to include the six infants that received either cefotaxime or cefazolin (both cephalosporin antibiotics) in conjunction with vancomycin. Seventy percent of the genomes obtained from these infant samples were used for training, 15% were used as a validation set for model improvement, and 15% were held out as a final test set. Several attempts to improve model performance through algorithm choice, feature engineering, and parameter tuning were applied, and the model that exhibited the best results with regard to precision and recall was selected. This model was then used to make predictions for the final test set. Each feature constructed for the model was a principal component of the Resfams and KEGG metabolic data, and the genes/modules contributing most strongly to each of these principal components were identified. The tendency of each of the genes and modules to occur in the increased-abundance class was calculated by adding Ϫ1 to the gene's mean value in the increased-abundance class divided by its mean value in the decreased-abundance class.
Data availability. The dataset used was comprised of 597 previously reported samples (55-57) and 305 new samples. These samples are available at NCBI under accession number SRP114966. The code for the analysis, along with all the data and metadata used in the analysis, is hosted at https://github.com/ SumayahR/antibiotic-resistance.

ACKNOWLEDGMENTS
The study was approved by the University of Pittsburgh Institutional Review Board (IRB) (protocol PRO12100487).
M.J.M. and J.F.B. conceived of the project. M.R.O. processed the sequence data, including binning and dereplication of genomes. S.F.R. performed the computational analysis and interpreted the results. S.F.R. and J.F.B. wrote the manuscript, and all of us contributed to the manuscript revisions.
Funding was provided through the National Institutes of Health (NIH) under grant RAI092531A and the Alfred P. Sloan Foundation under grant APSF-2012-10-05. This work used the Vincent J. Coates Genomics Sequencing Laboratory, supported by NIH Instrumentation Grant S10 OD018174.
We acknowledge Robyn Baker for recruiting infants and Brian Firek for performing DNA extractions. We also thank David Burstein for the KEGG HMM annotation pipeline and Christopher Brown for scripts to identify ribosomal proteins and to calculate genome coverage.