16S rRNA Amplicon Sequencing for Epidemiological Surveys of Bacteria in Wildlife

Several recent public health crises have shown that the surveillance of zoonotic agents in wildlife is important to prevent pandemic risks. High-throughput sequencing (HTS) technologies are potentially useful for this surveillance, but rigorous experimental processes are required for the use of these effective tools in such epidemiological contexts. In particular, HTS introduces biases into the raw data set that might lead to incorrect interpretations. We describe here a procedure for cleaning data before estimating reliable biological parameters, such as positivity, prevalence, and coinfection, using 16S rRNA amplicon sequencing on an Illumina MiSeq platform. This procedure, applied to 711 rodents collected in West Africa, detected several zoonotic bacterial species, including some at high prevalence, despite their never before having been reported for West Africa. In the future, this approach could be adapted for the monitoring of other microbes such as protists, fungi, and even viruses.

their never before having been reported for West Africa. In the future, this approach could be adapted for the monitoring of other microbes such as protists, fungi, and even viruses. KEYWORDS: bacteria, emerging infectious diseases, high-throughput sequencing, metabarcoding, molecular epidemiology, next-generation sequencing, rodents, West Africa, zoonoses P athogen monitoring in wildlife is a key method for preventing the emergence of infectious diseases in humans and domestic animals. More than half the pathogens causing disease in humans originate from animal species (1). The early identification of zoonotic agents in animal populations is therefore of considerable interest for human health. Wildlife species may also act as a reservoir for pathogens capable of infecting livestock, with significant economic consequences (2). The monitoring of emerging diseases in natural populations is also important for preserving biodiversity, because pathogens carried by invasive species may cause the decline of endemic species (3). There is, therefore, a need to develop screening tools for identifying a broad range of pathogens in samples consisting of large numbers of individual hosts or vectors.
High-throughput sequencing (HTS) approaches require no prior assumptions about the bacterial communities present in samples that are diverse in nature, including noncultivable bacteria. Such HTS microbial identification approaches are based on the sequencing of all (WGS [whole-genome sequencing]) or some (RNA-seq [whole-RNA sequencing] or 16S rRNA amplicon sequencing) of the bacterial DNA or RNA in a sample, followed by comparison to a reference sequence database (4). HTS has made major contributions to the generation of comprehensive inventories of the bacterial species, including pathogens, present in humans (5). Such approaches are now being extended to the characterization of bacteria in wildlife (6)(7)(8)(9)(10)(11)(12)(13). However, improvements in the estimation of infection risks will require more than just the detection of bacterial pathogens. Indeed, we will also need to estimate the prevalence of these pathogens by host taxon and/or environmental features, together with coinfection rates (14,15) and pathogen interactions (16,17). Razzauti et al. (8) recently used 16S rRNA amplicon sequencing with the dual-index sequencing strategy of Kozich et al. (18) to detect bacterial pathogens in very large numbers (up to several hundred samples in a single run) of rodent samples by the use of an Illumina MiSeq sequencing platform. The 16S rRNA amplicon sequencing technique is based on the amplification of small fragments of one or two hypervariable regions of the 16S rRNA gene. The sequences of these fragments are then obtained and compared with reference sequences in curated databases for taxonomic identification (4,19). Multiplexed approaches of this kind include short indices (or tags) that are linked to the PCR products and specific to a given sample. This makes it possible to assign the sequences generated by the HTS run to a particular sample following bioinformatic analysis of the data set (18). Razzauti et al. (8) demonstrated the considerable potential of this approach for determining the prevalence of bacteria within populations and for analyzing bacterial interactions within hosts and vectors, based on the accurate characterization of bacterial diversity within each of the individual samples that it provides. However, various sources of error during the generation and processing of HTS data (20) may make it difficult to determine which samples are really positive or negative for a given bacterium. The detection of one or a few sequences assigned to a given taxon in a sample does not necessarily mean that the bacterium is actually present in that sample. We carried out an extensive literature review, from which we identified several potential sources of error involving all stages of a 16S rRNA amplicon sequencing experiment-from the collection of samples to the bioinformatic analysisthat might lead to false-negative or false-positive screening results (Table 1) (18,19,(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40). These error sources have now been documented, and recent initiatives have called for the promotion of open sharing of standard operating procedures (SOP) and best practices in microbiome research (21). However, no experimental designs mini-TABLE 1 Sources of bias during the experimental and bioinformatic steps of 16S rRNA amplicon sequencing: consequences for data interpretation and solutions for mitigating these biases

Experimental step(s) Source(s) of errors Consequence(s) Solution(s) Sample collection
Cross-contamination between individuals (21) False-positive samples Rigorous processing (decontamination of the instruments, cleaning of the autopsy table, use of sterile bacterium-free consumables, gloves, masks) Negative controls during sampling (e.g., organs of healthy mice during dissection) Collection and storage conditions (21) False-positive and falsenegative samples Use of appropriate storage conditions/buffers; use of unambiguously identified samples; double-checking of tube labeling during sample collection DNA extraction Cross-contamination between samples (22) False-positive samples Rigorous processing (separation of pre-and post-PCR steps, use of sterile hood and filter tips and sterile bacterium-free consumables) Reagent contamination with bacterial DNA (21,23) False-positive samples Negative controls for extraction (extraction without sample) Small amounts of DNA (21,24) False-negative samples Use of an appropriate DNA extraction protocol; discarding of samples with a low DNA concentration Target DNA region and primer design Target DNA region efficacy (19,25) False-negative samples due to poor taxonomic identification Selection of an appropriate target region and design of effective primers for the desired taxonomic resolution Primer design (21,26) False-negative samples due to biases in PCR amplification for some taxa Checking of the universality of the primers with reference sequences Tag/index design and preparation False assignments of sequences due to cross-contamination between tags/indices (27,28) False-positive samples Rigorous processing (use of sterile hood and filter tips and sterile bacterium-free consumables, brief centrifugation before the opening of index storage tubes, separation of pre-and post-PCR steps) Negative controls for tags/indices (empty wells without PCR reagents for particular tags or index combinations) Positive controls for alien DNA, i.e., a bacterial strain highly unlikely to infect the samples studied (e.g., a host-specific bacterium unable to persist in the environment) to estimate false-assignment rate False assignments of sequences due to inappropriate tag/ index design (29) (27) False-positive samples due to false index pairing Use of a single barcode sequence for both the i5 and i7 indices for each sample (when possible, e.g., with a small number of samples) Positive controls for alien DNA, i.e., DNA from a bacterial strain highly unlikely to be found in the rodents studied (e.g., a host-specific bacterium unable to persist in the environment) Bioinformatics and taxonomic classification mizing the impact of these sources of error on HTS data interpretation have yet been reported. We describe here a rigorous experimental design for the direct estimation of biases from the data produced by 16S rRNA amplicon sequencing. We used these bias estimates to control and filter out potential false-positive and false-negative samples during screening for bacterial pathogens. We applied this strategy to 711 commensal rodents collected from 24 villages in Senegal, Western Africa: 208 Mus musculus domesticus, 189 Rattus rattus, 93 Mastomys natalensis, and 221 Mastomys erythroleucus. Pathogenic bacteria associated with the rodents were analyzed using a protocol based on Illumina MiSeq sequencing of the V4 hypervariable region of the 16S rRNA gene (18). We considered the common pitfalls listed in Table 1 during the various stages of the experiment (see details in the workflow procedure in Fig. 1). Biases in assessments of the presence or absence of bacteria in rodents were estimated directly from the data set by including and analyzing negative controls (NC) and positive controls (PC) at various stages of the experiment (see details in Materials and Methods) and by systematically using sample replicates. This strategy delivers realistic and reliable estimates of bacterial prevalence in wildlife populations and could be used to analyze the cooccurrence of different bacterial species within individuals.

RESULTS AND DISCUSSION
Raw sequencing results. The sequencing of 1,569 PCR products in two MiSeq runs generated a total of 23,698,561 raw paired-end sequence reads (251 bp) of the V4 region of the 16S rRNA gene. Because we made PCR replicates for each rodent sample, and because we included several controls in each sequencing run, we had more PCR products (N ϭ 1,569) than rodent samples (N ϭ 711) (see summary in Table S1 in the supplemental material and complete information by sample and run in Table S2). Overall, 99% of the PCRs generated more than 3,000 raw reads (mean, 11,908 reads; standard deviation, 6,062 reads). The raw sequence files have been deposited in FASTQ format in the Dryad Digital Repository (http://dx.doi.org/10.5061/dryad.m3p7d) (41).
Using mothur v1.34 (42) and the MiSeq standard operating procedure (http:// www.mothur.org/wiki/MiSeq_SOP), we removed 20.1% of the paired-end reads because they were misassembled, 1.5% of sequences because they were misaligned, 2.6% because they were chimeric, and 0.2% because they were nonbacterial. The remaining reads were grouped into operational taxonomic units (OTUs) with a divergence threshold of 3%. Bioinformatics analysis identified 13,296 OTUs, corresponding to totals of 7,960,533 sequences in run 1 and 6,687,060 sequences in run 2.
Taxonomic assignment of sequences. We used the Bayesian classifier (bootstrap cutoff ϭ 80%) implemented in mothur with Silva SSU Ref database v119 (42) as a reference for the taxonomic assignment of OTUs. The 50 most abundant OTUs accounted for 89% (minimum, 15,284 sequences; maximum, 2,206,731 sequences) of the total sequence data set (see Table S3 in the supplemental material). The accuracy of taxonomic assignment (to the genus level) was assessed with positive controls for PCR, corresponding to DNA extracts from laboratory isolates of Bartonella taylorii, Borrelia burgdorferi, and Mycoplasma mycoides (PC Bartonella_t , PC Borrelia_b , and PC Mycoplasma_m , respectively), which were correctly assigned to a single OTU corresponding to the appropriate reference sequences ( Table 2). Note that the sequences of PC Mycoplasma_m were assigned to Entomoplasmataceae rather than Mycoplasmataceae because of a  frequent taxonomic error reflected in most databases, including Silva (43). This problem might also affect other taxa. We therefore recommend systematically carrying out a blast analysis against the sequences of taxa of interest in GenBank to confirm the taxonomic assignment obtained with the 16S databases. Finally, we assumed that the small number of sequences per sample might limit the completeness of bacterial detection (36). For this reason, we discarded seven rodent samples (2 M. erythroleucus and 5 M. domesticus) yielding fewer than 500 sequences for at least one of the two PCR replicates (1% of the samples).
Filtering for reagent contaminants. 16S rRNA amplicon sequencing data may be affected by the contamination of reagents (23). We therefore filtered the data, using negative controls for extraction (NC ext ), corresponding to extraction without the addition of a tissue sample, and negative controls for PCR (NC PCR ), corresponding to PCR mixtures to which no DNA was added. We observed between 2,843 and 8,967 sequences in the NC ext and between 5,100 and 9,145 sequences in the NC PCR . On the basis of their high number of reads in negative controls, we identified 13 contaminant genera, including Pseudomonas, Acinetobacter, Herbaspirillum, Streptococcus, Pelomonas, Brevibacterium, Brachybacterium, Dietzia, Brevundimonas, Delftia, Comamonas, Corynebacterium, and Geodermatophilus, some of them having been previously identified in other studies (23). These contaminants accounted for 29% of the sequences in the data set ( Fig. 2). They also differed between MiSeq runs: Pseudomonas, Pelomonas, and Herbaspirillum predominated in run 1, whereas Brevibacterium, Brachybacterium, and Dietzia predominated in run 2 (see Table S4 and Fig. S1 in the supplemental material). This difference probably reflects the use of two different PCR kits manufactured several months apart (Qiagen technical service, personal communication). The majority of the other contaminants, such as Streptococcus, most likely originated from the DNA extraction kits used, as they were detected in abundance in the negative controls for extraction (NC ext ).
Genera identified as contaminants were then simply removed from the sample data set. Note, however, that the exclusion of these results does not rule out the possibility that our samples represented true rodent infections (at least for some of them, such as those by species of the genus Streptococcus, which contains both saprophytic and pathogenic species). However, as mentioned by Razzauti et al. (8), distinguishing between those two possibilities seems difficult, if not impossible. Faced with this lack of certainty, it was most prudent to simply remove these taxa from the sample data set. These results highlight the importance of carrying out systematic negative-control procedures to filter the taxa concerned in order to prevent inappropriate data interpretation, particularly for the Streptococcus genus, which contains a number of important pathogenic species. The use of DNA-free reagents would improve the quality of sequencing data and likely increase the depth of sequencing of the samples.
After filtering for reagent contaminants was performed as described above, 12 OTUs, belonging to 7 genera for which at least one species or one strain is known to be pathogenic in mammals (and that are therefore referenced here as "pathogenic genera"), accounted for 66% of the sequences identified in wild-rodent samples for the two MiSeq runs combined (Fig. 2). These genera are Bartonella, Borrelia, Ehrlichia, Mycoplasma, Orientia, Rickettsia, and Streptobacillus. Six different OTUs were obtained for Mycoplasma (Mycoplasma_OTU_1 to Mycoplasma_OTU_6) and one OTU each for the other genera (Table 2). Finally, the precise significance of the remaining 34% of sequences, which potentially corresponded to commensal bacteria (Bacteroidales, Bacteroides, Enterobacteriaceae, Helicobacter, Lactobacillus), unknown pathogens, undetected contaminants, or undetected sequencing errors, was undetermined.
Filtering for false-positive results. mothur analysis produced a table of abundance, giving the number of sequences for each OTU in each PCR product (data have been deposited in the Dryad Digital Repository [http://dx.doi.org/10.5061/ dryad.m3p7d]) (41). The multiple biases present during the experimental steps and data processing steps listed in Table 1 made it impossible to infer prevalence and cooccurrence directly from the table of sequence presence/absence in the PCR products. We suggest filtering the data with estimates of the different biases calculated from the multiple controls introduced during the process. This strategy involves calculating sequence number thresholds from our bias estimates. Two different thresholds were set for each of the 12 OTUs and two MiSeq runs. We then discarded positive results associated with sequence counts below the threshold (Fig. 1).
Threshold T CC : filtering for cross-contamination. One source of false positives is cross-contamination between samples processed in parallel (Table 1). Negative controls for dissection (NC mus ), consisting of the spleens of healthy laboratory mice manipulated during sessions of wild-rodent dissection, and negative controls for extraction (NC ext ) and PCR (NC PCR ) were used, together with positive controls for PCR (PC Bartonella_t , PC Borrelia_b , and PC Mycoplasma_m ), to estimate levels of crosscontamination. For each sequencing run, we calculated the maximal number of sequences for the 12 pathogenic OTUs in the negative and positive controls. These numbers ranged from 0 to 115 sequences, depending on the OTU and the run considered (Table 2), and we used them to establish values for OTU-specific thresholds of cross-contamination (T CC ) for each run. For example, in sequencing run 2, the highest number of sequences in a control for Mycoplasma_OTU_2 was 115 (in an NC ext ). Therefore, we established the threshold value at 115 sequences for this OTU in sequencing run 2. Thus, PCR products with fewer than 115 sequences for the Myco-plasma_OTU_2 in sequencing run 2 were considered to represent false-positive results for this OTU. The use of these T CC values led to 0% to 69% of the positive results being discarded, corresponding to only 0% to 0.14% of the sequences, depending on the OTU considered ( Fig. 3; see also Table S5 in the supplemental material). A PCR product may be positive for several bacteria in cases of coinfection. In such cases, the use of a T CC makes it possible to discard the positive result for one bacterium while retaining positive results for other bacteria.
Threshold T FA : filtering out incorrectly assigned sequences. Another source of false positives is the incorrect assignment of sequences to a PCR product (Table 1). This phenomenon may be due either to cross-contamination between indices during the experiment or to the generation of mixed clusters during the sequencing (27).
First, the cross-contamination of indices may happen during the preparation of indexed primer microplates. This cross-contamination was estimated using negative-control index pairs (NC index ) corresponding to particular index pairs not used to identify the samples. NC index returned very low (1 to 12) read numbers, suggesting that there was little or no cross-contamination between indices in our experiment.
Second, the occurrence of mixed clusters during the sequencing of multiplexed samples was previously reported by Kircher et al. (27). Mixed clusters on an Illumina flow cell surface are described by Kircher et al. (27) as the predominant source of error of sequence assignment to a PCR product. The impact of this phenomenon on our experiment was estimated using "alien" positive controls (PC alien ) (sequenced in parallel with the rodent samples): PC Mycoplasma_m , corresponding to the DNA of Mycoplasma mycoides, which cannot infect rodents; and PC Borrelia_b , containing the DNA of Borrelia burgdorferi, which is not present in Africa. Neither of these bacterial species can survive in abiotic environments, so the presence of their sequences in African rodent PCR products indicates a sequence assignment error due to false index pairing (27). Using PC Mycoplasma_m , we obtained an estimate of the global false-index-pairing rate of 0.14% (i.e., 398 of 280,151 sequences of the Mycoplasma mycoides OTU were assigned to samples other than PC Mycoplasma_m ). Using PC Borrelia_b , we obtained an estimate of 0.22% (534 of 238,772 sequences of the Borrelia burgdorferi OTU were assigned to samples other than PC Borrelia_b ). These values are very close to the estimate of 0.3% obtained by Kircher et al. (27). Close examination of the distribution of misassigned sequences within the PCR 96-well microplates showed that all PCR products with misassigned sequences had one index in common with either PC Mycoplasma_m or PC Borrelia_b (see Fig. S2 in the supplemental material).
We then estimated the impact of false index pairing for each PCR product by calculating the maximal number of sequences of alien bacteria assigned to PCR products other than the corresponding PC. These numbers ranged from 28 to 43, depending on the positive control for run 1 ( Table 2)-run 2 was discarded because of the low values of the numbers of sequences, which were obtained likely due to the fact that DNAs of PC were diluted 100-fold in run 2 (see Table S1 in the supplemental material). We then estimated a false-assignment rate for each PCR product (R fa ) by dividing the numbers given above by the total number of sequences from alien bacteria in sequencing run 1. R fa was estimated for PC Mycoplasma_m and PC Borrelia_b separately. R fa reached 0.010% and 0.018% for PC Mycoplasma_m and PC Borrelia_b , respectively. We adopted a conservative approach by fixing the R fa value at 0.020%. This number signifies that each PCR product may receive a maximum of 0.020% of the total number of sequences of an OTU present in a run due to false index pairing. Moreover, the number of sequences for a specific OTU misassigned to a PCR product should increase with the total number of sequences of the OTU in the MiSeq run. We therefore defined the second threshold (T FA ) as the total number of sequences in the run for an OTU multiplied by R fa . T FA values varied with the abundance of each OTU in the sequencing run (Table 2). Because the abundance of each OTU varied from one sequencing run to the next, T FA also varied according to the sequencing run. The use of the T FA led to 0% to 87% of positive results being discarded. This corresponded to 0% to 0.71% of the sequences, depending on the OTU (Fig. 3; see also Table S5 in the supplemental material).
Validation performed with PCR replicates. Random contamination may occur during the preparation of PCR 96-well microplates. These contaminants may affect some of the wells, but not those for the negative controls, leading to the generation of false-positive results. We thus adopted a conservative approach in which we considered rodents to be positive for a given OTU only if both PCR replicates were considered to represent positive results after the filtering steps described above were performed. The relevance of this strategy was supported by the strong correlation between the numbers of sequences for the two PCR replicates for each rodent (R 2 Ͼ 0.90) (Fig. 4; see also Fig. S3 in the supplemental material). At this stage, 673 positive results for 419 rodents were validated for both replicates (note that a rodent sample may be positive for several bacterial species and may thus be counted several times), whereas only 52 positive results were discarded because the result for the other replicate was negative. At this final validation step, 0% to 60% of the positive results for a given OTU were discarded, corresponding to only 0% to 7.17% of the sequences ( Fig. 3; see also Table S5 and Table S6 in the supplemental material). Note that the number of replicates may be increased, as previously described for the strategy of Gómez-Díaz et al (44).
Postfiltering results. Finally, the proportions of rodents positive for a given OTU filtered out by the complete filtering approach ranged from 6% to 86%, depending on the OTU, corresponding to only 1% of the total sequences (Fig. 3). Indeed, our filtering strategy mostly excluded rodents with a small number of sequences for the OTU concerned. These rodents were considered to represent false-positive results.
Refining bacterial taxonomic identification. We refined the taxonomic identification of the 12 bacterial OTUs through phylogenetic and blast analyses. We were able to identify the bacteria present down to the genus level, and, in some cases, we could even identify the most likely species (Table 3; 45) for OTU_2-with high percentages of sequence identity (Ն93%) and bootstrap values (Ն80%). All three of these species belong to the Hemoplasma group, whose members are known to infect mice, rats, and other mammals (46,47) and are thought to cause anemia in humans (48,49). The Borrelia sequences grouped with three different species of the relapsing fever group (B. crocidurae, B. duttonii, and B. recurrentis) with a high percentage of identity (100%) and   (50). The Ehrlichia sequences were 100% identical to and clustered with the sequences of the recently described "Candidatus Ehrlichia khabarensis" species isolated from voles and shrews in the Far East of the Russian Federation (51). The Rickettsia sequences were 100% identical to the sequence of R. typhi, a species of the typhus group responsible for murine typhus (52), but species of this clade were differentiated from many other Rickettsia species only weakly (bootstrap support of 61%). The most likely species corresponding to the sequences of the Streptobacillus OTU was S. moniliformis, with a high percentage of identity (100%) and a bootstrap value of 100%. This bacterium is common in rats and mice and causes a form of rat-bite fever, Haverhill fever (53). The Orientia sequences corresponded to O. chuto, with a high percentage of identity (100%) and a bootstrap value of 77%. This species was recently isolated from a patient infected in Dubai (54). Finally, accurate species determination was not possible for Bartonella, as the 16S rRNA gene does not resolve the species of this genus well (55). Indeed, the sequences from the Bartonella OTU detected in our rodents corresponded to at least seven different species (B. elizabethae, B. japonica, B. pachyuromydis, B. queenslandis, B. rattaustraliani, Streptobacillus is a genus of aerobic, Gramnegative facultative anaerobe bacteria, which grow in culture as rods in chains. Streptobacillus moniliformis is common in rats and mice and is responsible of the streptobacillosis form of rat-bite fever, the Haverhill fever. This zoonosis begins with high prostrating fevers, rigors (shivering), headache, and polyarthralgia (joint pain). Left untreated, rat-bite fever has a mortality rate of approximately 10% (53). B. tribocorum, and B. vinsonii) and a putative new species recently identified in Senegalese rodents (56). These findings demonstrate the considerable potential of 16S rRNA amplicon sequencing for the rapid identification of zoonotic agents in wildlife, provided that the postsequencing data are cleaned beforehand. Borrelia (50) and Bartonella (56) were the only ones of the seven pathogenic bacterial genera detected here in Senegalese rodents to have been reported as present in rodents from West Africa before. The other bacterial genera identified here have previously been reported to be present in rodents only in other parts of Africa or on other continents. Streptobacillus moniliformis has recently been detected in rodents from South Africa (57), and there have been a few reports of human streptobacillosis in Kenya (58) and Nigeria (59). R. typhi was recently detected in rats found in Zaire, in Central Africa (60), and human seropositivity for this bacterium has been reported in coastal regions of West Africa (61). With the exception of one study in Egypt published some time ago (62), Mycoplasma spp. have never before been reported in African rodents. Several species of Ehrlichia (from the E. canis group: E. chaffeensis, E. ruminantium, E. muris, and E. ewingii) have been characterized in West Africa, but only in ticks from cattle (63), and a report of possible cases of human ehrlichioses in this region was previously published (64). Finally, the present study reports the first identification of Orientia in African rodents (9). There have already been a few reports of suspected human infection with this bacterium in Zaire, Cameroon, Kenya, and Tanzania (65).
Estimating prevalence and coinfection. After performing data filtering, we were able to estimate the prevalence in rodent populations and to assess coinfection in individual rodents for the 12 bacterial OTUs. Rates of bacterial prevalence differed considerably between rodent species (Table 3). Bartonella spp. were highly prevalent in the two multimammate rat species M. natalensis (79%) and M. erythroleucus (27%); Orientia spp. were prevalent in the house mouse species M. musculus (22%); and Ehrlichia spp. occurred frequently in only one of the two rats of the multimammate species M. erythroleucus (18%). In contrast, the prevalence of Streptobacillus and Rickettsia was low in all rodent species (Ͻ5%). Coinfection was common, as 184 rodents (26%) were found to be coinfected with bacteria from two (19%), three (5%), four (2%), or five (0.1%) different bacterial pathogens.
Interestingly, several Mycoplasma OTUs appeared to be specific to a rodent genus or species (Table 3 and Fig. 5A). OTU_2, putatively identified as a recently described lineage isolated from brown rat, Rattus norvegicus (45), was specifically associated with R. rattus in this study. Of the OTUs related to M. coccoides, OTU_4 was found exclusively in R. rattus, whereas OTU_5 and OTU_6 seemed to be specific to the two multimammate rats (M. erythroleucus and M. natalensis). Comparative phylogenies of Mycoplasma OTUs and rodents showed that R. rattus, which is phylogenetically more distantly related to the other three rodents, contained a Mycoplasma community different from that in the rodents in the Mus-Mastomys clade (Fig. 5A). Pathogen prevalences also differed considerably between sites, as shown for the six Mycoplasma OTUs (Fig. 5B). This suggests that the infection risks for animals and humans vary greatly according to environmental characteristics and/or biotic features potentially related to recent changes in the distribution of rodent species in Senegal (66,67).

Perspectives. (i) Recommendation for future experiments.
Our experiments demonstrated the need to include many different kinds of controls, at different steps, in order to avoid data misinterpretation. In particular, alien positive controls are important for establishing threshold values for OTU positivity. These alien positive controls should include taxa that are distant enough from those of potential pathogens to avoid potential confusion between sequences of alien controls and sequences that result from actual infection of rodent samples. Ideally, one should choose alien positive controls from bacterial genera which are not able to infect the study's host species. In our study, the use of Mycoplasma and Borrelia species as alien positive controls was not ideal because both genera are potential rodent pathogens. Thankfully, the species used as alien controls could be easily distinguished from the species found in rodents on the basis of the phylogenetic analyses of the V4 sequences. However, on the basis of our experience, we recommend using bacterial genera phylogenetically distant from pathogenic genera as alien controls when possible.
The inclusion of negative controls of PCR and DNA extraction in studies based on massive sequencing of 16S rRNA amplicons had long been overlooked until a report published by Salter in 2014 (23) demonstrated the high pollution of laboratory reagents by bacterial DNA. Most studies published prior to that report reported no extraction controls in their protocols. Here, we performed one negative control for extraction per DNA extraction microplate; with each run consisting of four DNA extraction microplates, and each control having been analyzed in two replicate experiments, we had a total of 8 negative controls for extraction per run which were analyzed twice. On the basis of our experience, we recommend performing at least this number of extraction controls per run. Further increases in the number of extraction controls per microplate would further improve the efficiency of data filtering and so the quality of the data produced.
The protocol of PCR amplification is also of importance for ensuring data quality. In our study, we built separate amplicon libraries for each sample separately and used very long (5-min) PCR elongation times in order to mitigate the formation of chimeric reads (18). High numbers of PCR cycles are also known to increase chimera formation, yet, as mentioned by Schnell et al. (68), this parameter is mainly critical only when bulk amplification of pools of tagged/indexed amplicons is performed (e.g., when using an Illumina TrueSeq library preparation kit). As we used separate amplicon libraries for each sample, we believe that the relatively high number of PCR cycles that we used (40 cycles) had minimal impact on chimera formation and that this protocol ensures the absence of chimeric sequences between samples. We had chosen to maximize the number of cycles to enhance our ability to detect pathogenic bacteria, which are sometimes of low quantity in animal samples. Fine-tuning the balance between these parameters deserves further study.
Moreover, we targeted the spleen to detect bacterial infections in our study based on the fact that this organ is known to filter microbial cells in mammals. However, we lack the data to be certain that the spleen is the best organ for use in giving a global picture of bacterial infection in rodents (and, more broadly, in vertebrates). We are currently conducting new experiments to address this issue.
Finally, in our experiments, about a third of OTU sequences were attributable neither to contamination nor to (known) pathogenic genera. We currently have no precise idea of the significance of the presence of these OTUs in the rodent spleens. Some of these OTUs could be linked to further undetected biases during data generation; in spite of all the precautions we have implemented here, other biases may still elude detection. Such biases could explain the very high numbers of rare OTUs (11,947 OTUs corresponding to Ͻ100 reads), which together represent more than 88% of the total number of OTUs but less than 1% of the total number of sequences (with the two runs combined).
Additionally, the presence of an OTU in a rodent spleen does not necessarily imply that the OTU is pathogenic. We know little about the microbiome of healthy organs of vertebrates, and yet the sharp increase in the number of microbiome studies over the last few years has led to the discovery that communities of microbiota appear to be specific to each part of the vertebrate's body, including internal tissues and blood (69). The OTUs detected in rodent's spleen could thus simply be part of the healthy microbiome of the organ. These issues deserve better documentation. Our results thus pave the way for future research on unknown bacterial pathogens and on the microbiome of healthy organs in vertebrates.
(ii) Improving HTS for epidemiological surveillance. The screening strategy described here has the considerable advantage of being nonspecific, making it possible to detect unanticipated or novel bacteria. Razzauti et al. (8) recently showed that the sensitivity of 16S rRNA amplicon sequencing on the MiSeq platform was equivalent to that of whole-RNA sequencing (RNA-seq) on the HiSeq platform for detecting bacteria in rodent samples. However, little is known about the comparative sensitivities of HTS approaches relative to the sensitivity of quantitative PCR (qPCR) performed with specific primers, the current gold standard for bacterial detection within biological samples. Additional studies are required to address this issue. Moreover, as 16S rRNA amplicon sequencing is based on a short sequence, it does not yield results sufficiently high in resolution to distinguish between species in some bacterial genera, such as Bartonella, or to distinguish between pathogenic and nonpathogenic strains within the same bacterial species. To get this information, we thus need to follow up the 16S rRNA amplicon sequencing with complementary laboratory work. Whole-genome shotgun or RNA-seq techniques provide longer sequences, through the production of longer reads or the assembly of contigs, and they might therefore increase the accuracy of species detection (70). However, these techniques would be harder to adapt for the extensive multiplexing of samples (8). Other methods could be used to assign sequences to bacterial species or strains for samples found to be positive for a bacterial genus following the 16S rRNA screening. For example, positive PCR assays could be carried out with bacterial genus-specific primers, followed by amplicon sequencing, as commonly used in multilocus sequence analysis (MLSA) strategies (71), or high-throughput microfluidic qPCR assays based on bacterial species-specific primers could be used (72). High-throughput amplicon sequencing approaches could be fine-tuned to amplify several genes for species-level assign-ment, such as the gltA gene used by Gutiérrez et al. (73) for the Bartonella genus, in parallel with the 16S rRNA-V4 region.
This strategy could also easily be adapted for other microbes, such as protists, fungi, and even viruses, provided that universal primers are available for their detection (see references 74 and 75 for protists and fungi and reference 76 for degenerate virus family-level primers for viruses). Finally, our filtering method could also be translated to any other postsequencing data set of indexed or tagged amplicons in the framework of environmental studies (e.g., metabarcoding for diet analysis and biodiversity monitoring [77], the detection of rare somatic mutations [78], or the genotyping of highly polymorphic genes [e.g., major histocompatibility complex {MHC} or HLA typing] [79,80]).
(iii) Monitoring the risk of zoonotic diseases. Highly successful synanthropic wildlife species, such as the rodents studied here, will probably play an increasingly important role in the transmission of zoonotic diseases (81). Many rodent-borne pathogens cause only mild or undifferentiated disease in healthy people, and these illnesses are often misdiagnosed and underreported (53,(82)(83)(84)(85). The information about pathogen circulation and transmission risks in West Africa provided by this study is important in terms of human health policy. We show that rodents carry species of seven major pathogenic bacterial genera: Borrelia, Bartonella, Mycoplasma, Ehrlichia, Rickettsia, Streptobacillus, and Orientia. The last five of these genera have never before been reported in West African rodents. The data generated with our HTS approach could also be used to assess zoonotic risks and to formulate appropriate public health strategies involving the focusing of continued pathogen surveillance and disease-monitoring programs on specific geographic areas or rodent species likely to be involved in zoonotic pathogen circulation, for example. Experimental controls. Recent research has highlighted different biases occurring at different steps of high-throughput sequencing. These biases can be estimated directly from the data by including several controls together with samples in the experiment. We detail below these different controls as well as the rationale for their use.

MATERIALS AND METHODS
(i) Negative controls for sample collection. When possible, we advise inclusion of axenic samples during collection of the samples. The numbers of sequences observed in these controls are used to estimate cross-contamination rates during sample collection. In our study, we used spleens from healthy laboratory mice (NC mus ), free from rodent pathogens, which were manipulated together with wild-rodent samples during the dissections in the field.
(ii) Negative controls for DNA extraction (NC ext ). DNA extractions should be performed without the addition of sample tissue (blanks), which are processed together with the other samples. We advise performing at least one extraction blank experiment for each extraction experiment, although more would be better. The numbers of sequences observed in these controls are used to estimate and filter the cross-contaminations during the DNA extractions and to detect any DNA bacterial contaminants present in the extraction kit reagents.
(iii) Negative controls for PCR (NC PCR ). PCRs should be performed without any DNA extract included (blank), which should be processed together with the other samples. We advise performing at least one PCR blank experiment per PCR microplate, although more would be better. The numbers of sequences observed in these controls are used to estimate and filter the cross-contaminations during the PCR preparation and to detect any DNA bacterial contaminants present in the PCR reagents.
(iv) Negative controls for indexing (NC index ). Combinations of barcodes that are not used to identify samples in the sequencing run should be searched for during the bioinformatic demultiplexing. In practice, they should correspond to empty PCR wells (without reagent and without index). The numbers of sequences recovered for these particular index combinations should be used to estimate and filter the cross-contaminations between indexed PCR primers during primer handling or PCR preparation and to identify errors in an Illumina sample sheet.
(v) Positive controls for PCR (PC PCR ). DNA from isolates of known taxa should be processed together with the other samples. The sequences obtained for these controls should be used to verify the taxonomic assignment and to estimate and filter cross-contaminations.
(vi) Positive controls for indexing (PC alien ). DNA from isolates of known taxa should be absent from the samples. They should be handled separately from the samples to avoid cross-contaminations with the samples during the wet laboratory procedures (DNA extractions and PCRs). Sequences from PC alien found in the samples are used to calculate the rate of sample misidentification due to false index pairing (see text and Kircher et al. [27] for details concerning this phenomenon).
In practice, PC PCR and PC alien could be the same, and we advise the use of taxa that are phylogenetically distant from the taxa of interest in order to avoid potential confusion between sequences from alien controls and sequences from the samples.
Sample collection. We sampled rodents in 24 villages of the Sahelian and Sudanian climatic and biogeographical zones in Senegal (see Dalecky et al. [67] for details on the geographic location and other information on the villages). Rodents were sampled by live trapping according to the standardized protocol described by Dalecky et al. (67). Briefly, traps were set within homes (with one single-capture wire-mesh trap and one Sherman folding box trap used per room) during one day to five consecutive days. Each captured rodent was collected alive and transported to the field laboratory. There, rodents were killed by cervical dislocation as recommended by Mills et al. (86) and dissected as described by Herbreteau et al. (87). Rodent species were identified by morphological and/or molecular techniques (67). The information concerning the rodent collection (sample identifier [ID], locality, and species) is provided in Table S2 in the supplemental material. Cross-contamination during dissection was prevented by washing the tools used successively in bleach, water, and alcohol between rodent procedures. We used the spleen for bacterial detection, because this organ is a crucial site of early exposure to bacteria (88). Spleens were placed in RNAlater (Sigma) and stored at 4°C for 24 h and then at Ϫ20°C until their use for genetic analyses.
Target DNA region and primer design. We used primers with sequences slightly modified from those of the universal primers described by Kozich  We used a slightly modified version of the dual-index method of Kozich et al. (18) to multiplex our samples. The V4 primers included different 8-bp indices (called the i5 index in the forward primer and the i7 index in the reverse primer) and Illumina adapters (called the P5 adapter in the forward primer and the P7 adapter in the reverse primer) in the 5= position. The combinations of 24 i5-indexed primers and 36 i7-indexed primers made it possible to identify 864 different PCR products loaded onto the same MiSeq flow cell. Each index sequence differed from the others by at least two nucleotides, and each nucleotide position in the sets of indices contained approximately 25% of each base, to prevent problems due to Illumina low-diversity libraries (Table 1).
DNA extraction and PCRs. All pre-PCR laboratory manipulations were conducted with filter tips under a sterile hood in a DNA-free room, i.e., a room dedicated to the preparation of PCR mix and equipped with hoods that are kept free of DNA by UV irradiation and bleach treatment. DNA from bacterial isolates (corresponding to DNA extracts from laboratory isolates of Bartonella taylorii, Borrelia burgdorferi, and Mycoplasma mycoides) was extracted in another laboratory, and PCRs using these isolates were performed after the amplifications of the DNA from rodents to avoid crosscontamination between samples and bacterial isolates. DNA was extracted with a DNeasy 96 tissue kit (Qiagen), with final elution in 200 l of elution buffer. One extraction blank (NC ext ), corresponding to an extraction without sample tissue, was systematically added to each of the eight DNA extraction microplates. DNA was quantified with a NanoDrop 8000 spectrophotometer (Thermo Scientific) to confirm the presence of a minimum of 10 ng/l of DNA in each sample. DNA amplification was performed in 5 l of Multiplex PCR kit (Qiagen) master mix, with 4 l of combined i5 and i7 primers (3.5 M) and 2 l of genomic DNA. The PCR began with an initial denaturation at 95°C for 15 min; followed by 40 cycles of denaturation at 95°C for 20 s, annealing at 55°C for 15 s, and extension at 72°C for 5 min; followed by a final extension step at 72°C for 10 min. PCR products (3 l) were verified by electrophoresis in a 1.5% agarose gel. One PCR blank (NC PCR ), corresponding to the PCR mix with no DNA, was systematically added to each of the 18 PCR microplates. DNA was amplified in replicate for all wild-rodent samples (n ϭ 711) (for a summary, see Table S1 in the supplemental material; for details by sample, see Table S2).
Library preparation and MiSeq sequencing. Two Illumina MiSeq runs were conducted. Run 1 included the PCR products (two or three replicates per sample) from wild rodents collected in north Senegal (148 Mastomys erythroleucus and 207 Mus musculus) plus the positive controls and the negative controls. Run 2 included the PCR products (two replicates per samples) from wild rodents collected in south Senegal (73 Mastomys erythroleucus, 93 Mastomys natalensis, and 190 Rattus rattus) plus the positive controls and the negative controls. Full details on the composition of runs are given in Table S2 in the supplemental material. The MiSeq platform was chosen because it generates lower error rates than other HTS platforms (90). The numbers of PCR products multiplexed were 823 for the first MiSeq run and 746 for the second MiSeq run (see Table S2). Additional PCR products from other projects were added to give a total of 864 PCR products per run. PCR products were pooled by volume for each 96-well PCR microplate, using 4 l for rodents and controls and 1.5 l for bacterial isolates. Mixes were checked by electrophoresis on 1.5% agarose gels before their use to generate a "superpool" of 864 PCR products for each MiSeq run. We subjected 100 l of each superpool to size selection for the full-length amplicon (expected median sizes of V4 hypervariable region, 375 bp [including primers, indices, and adaptors] and 251 bp [excluding primers, indices, and adaptors]) by excision from a low-melting (1.25%) agarose gel to discard nonspecific amplicons and primer dimers. A PCR cleanup gel extraction kit (Macherey-Nagel) was used to purify the excised bands. DNA was quantified by the use of a Kapa library quantification kit (Kapa Biosystems) for the final library before loading on a MiSeq (Illumina) flow cell (expected cluster density, 700,000 to 800,000/mm 2 ) was performed with reagent kit v2 (Illumina) (500 cycles). We performed two runs of 251-bp paired-end sequencing, which yielded high-quality sequencing through the reading of each nucleotide of the V4 fragments twice after the assembly of read 1 and read 2. The raw sequence reads (.fastq format) have been deposited in the Dryad Digital Repository http://dx.doi.org/10.5061/dryad.m3p7d (41).
Bioinformatic and taxonomic classification. MiSeq datasets were processed with mothur v1.34 (42) and with the MiSeq standard operating procedure (SOP) (18). Briefly, the MiSeq SOP (http:// www.mothur.org/wiki/MiSeq_SOP) allowed us to (i) construct contigs of paired-end read 1 and read 2 using the make.Contig command; (ii) remove the reads with poor quality of assembly (Ͼ275 bp); (iii) align unique sequences using Silva SSU Reference alignment v119 (89); (iv) remove the misaligned, nonspecific (eukaryotic), and chimeric reads (uchime program); (v) regroup the reads into operational taxonomic units (OTUs) with a 3% divergence threshold; and (vi) classify the OTUs using the Bayesian classifier included in mothur (bootstrap cutoff, 80%) and the Silva taxonomic file. At the end of the process, we obtained a table giving the number of reads for each OTU in each line and each PCR product in each column. For each OTU, the taxonomic classification (up to genus level) was provided. The abundance table generated by mothur for each PCR product and each OTU was filtered as described in Results and Discussion. The most abundant sequence for each OTU in each sample was extracted from the sequence data set with a custom-written Perl script (deposited in the Dryad Digital Repository [http://dx.doi.org/10.5061/dryad.m3p7d]) (41). The most abundant sequences for the 12 OTUs were deposited in GenBank (see below). The sequences were aligned with reference sequences from bacteria of the same genus available from Silva SSU Ref NR database v119 using SeaView v4 (91). We used a neighbor-joining method (bioNJ) to produce phylogenetic trees with a Kimura 2-parameter model using SeaView software, and species were identified on the basis of the "closest phylogenetic species." We also used our sequences for blast analyses of GenBank (BLASTn against nucleotide collection [nr/nt] performed in January 2016) to identify the reference sequences to which they displayed the highest percentage of identity. The raw abundance table, the mothur command lines, the mothur output files, the Perl script, and the FASTA files used for the phylogenetic analyses have been deposited in the Dryad Digital Repository (http://dx.doi.org/10.5061/dryad.m3p7d) (41).
Accession numbers. The most abundant sequences for the 12 OTUs are available in GenBank (accession numbers KU697337 to KU697350).

ACKNOWLEDGMENTS
The study was conceived and designed by M.G. and J. We thank Virginie Dupuy for extracting DNA from bacterial cultures as well as Julie Sappa from Alex Edelman and Associates, Jessie L. Abbate, and Petra Villette for improving the English writing. Analyses were performed on the CBGP HPC computational platform.