Skip to main content
  • ASM Journals
    • Antimicrobial Agents and Chemotherapy
    • Applied and Environmental Microbiology
    • Clinical Microbiology Reviews
    • Clinical and Vaccine Immunology
    • EcoSal Plus
    • Infection and Immunity
    • Journal of Bacteriology
    • Journal of Clinical Microbiology
    • Journal of Microbiology & Biology Education
    • Journal of Virology
    • mBio
    • Microbiology and Molecular Biology Reviews
    • Microbiology Resource Announcements
    • Microbiology Spectrum
    • Molecular and Cellular Biology
    • mSphere
    • mSystems
  • Log in
  • My alerts
  • My Cart

Main menu

  • Home
  • Articles
    • Latest Articles
    • Special Issues
    • COVID-19 Special Collection
    • Editor's Picks
    • Special Series: Sponsored Minireviews and Video Abstracts
    • Archive
  • Topics
    • Applied and Environmental Science
    • Ecological and Evolutionary Science
    • Host-Microbe Biology
    • Molecular Biology and Physiology
    • Novel Systems Biology Techniques
    • Early-Career Systems Microbiology Perspectives
  • For Authors
    • Getting Started
    • Submit a Manuscript
    • Scope
    • Editorial Policy
    • Submission, Review, & Publication Processes
    • Organization and Format
    • Errata, Author Corrections, Retractions
    • Illustrations and Tables
    • Nomenclature
    • Abbreviations and Conventions
    • Publication Fees
    • Ethics
  • About the Journal
    • About mSystems
    • Editor in Chief
    • Board of Editors
    • For Reviewers
    • For the Media
    • For Librarians
    • For Advertisers
    • Alerts
    • RSS
    • FAQ
  • ASM Journals
    • Antimicrobial Agents and Chemotherapy
    • Applied and Environmental Microbiology
    • Clinical Microbiology Reviews
    • Clinical and Vaccine Immunology
    • EcoSal Plus
    • Infection and Immunity
    • Journal of Bacteriology
    • Journal of Clinical Microbiology
    • Journal of Microbiology & Biology Education
    • Journal of Virology
    • mBio
    • Microbiology and Molecular Biology Reviews
    • Microbiology Resource Announcements
    • Microbiology Spectrum
    • Molecular and Cellular Biology
    • mSphere
    • mSystems

User menu

  • Log in
  • My alerts
  • My Cart

Search

  • Advanced search
mSystems
publisher-logosite-logo

Advanced Search

  • Home
  • Articles
    • Latest Articles
    • Special Issues
    • COVID-19 Special Collection
    • Editor's Picks
    • Special Series: Sponsored Minireviews and Video Abstracts
    • Archive
  • Topics
    • Applied and Environmental Science
    • Ecological and Evolutionary Science
    • Host-Microbe Biology
    • Molecular Biology and Physiology
    • Novel Systems Biology Techniques
    • Early-Career Systems Microbiology Perspectives
  • For Authors
    • Getting Started
    • Submit a Manuscript
    • Scope
    • Editorial Policy
    • Submission, Review, & Publication Processes
    • Organization and Format
    • Errata, Author Corrections, Retractions
    • Illustrations and Tables
    • Nomenclature
    • Abbreviations and Conventions
    • Publication Fees
    • Ethics
  • About the Journal
    • About mSystems
    • Editor in Chief
    • Board of Editors
    • For Reviewers
    • For the Media
    • For Librarians
    • For Advertisers
    • Alerts
    • RSS
    • FAQ
Research Article | Applied and Environmental Science

FORENSIC: an Online Platform for Fecal Source Identification

Adélaïde Roguet, Özcan C. Esen, A. Murat Eren, Ryan J. Newton, Sandra L. McLellan
Sarah M. Allard, Editor
Adélaïde Roguet
aSchool of Freshwater Sciences, University of Wisconsin–Milwaukee, Milwaukee, Wisconsin, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Adélaïde Roguet
Özcan C. Esen
bDepartment of Medicine, University of Chicago, Chicago, Illinois, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
A. Murat Eren
bDepartment of Medicine, University of Chicago, Chicago, Illinois, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for A. Murat Eren
Ryan J. Newton
aSchool of Freshwater Sciences, University of Wisconsin–Milwaukee, Milwaukee, Wisconsin, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ryan J. Newton
Sandra L. McLellan
aSchool of Freshwater Sciences, University of Wisconsin–Milwaukee, Milwaukee, Wisconsin, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sandra L. McLellan
Sarah M. Allard
University of California San Diego
Roles: Editor
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
DOI: 10.1128/mSystems.00869-19
  • Article
  • Figures & Data
  • Info & Metrics
  • PDF
Loading

ABSTRACT

Sewage overflows, agricultural runoff, and stormwater discharges introduce fecal pollution into surface waters. Distinguishing these sources is critical for evaluating water quality and formulating remediation strategies. With the falling costs of sequencing, microbial community-based water quality assessment tools are under development. However, their application is limited by the need to build reference libraries, which requires extensive sampling of sources and bioinformatic expertise. Here, we introduce FORest Enteric Source IdentifiCation (FORENSIC; https://forensic.sfs.uwm.edu/), an online, library-independent source tracking platform based on random forest classification and 16S rRNA gene amplicon sequences to identify in environmental samples common fecal contamination sources, including humans, domestic pets, and agricultural animals. FORENSIC relies on a broad reference signature database of Bacteroidales and Clostridiales, two predominant bacterial groups that have coevolved with their hosts. As a result, these groups demonstrate cohesive and reliable assemblage patterns within mammalian species or among species sharing the same diet/physiology. We created a scalable and extensible platform that we tested for global applicability using samples collected in distant geographic locations. This Web application offers a fast and intuitive approach for fecal source identification, particularly in sewage-contaminated waters.

IMPORTANCE FORENSIC is an online platform to identify sources of fecal pollution without the need to create reference libraries. FORENSIC is based on the ability of random forest classification to extract cohesive source microbial signatures to create classifiers despite individual variability and to detect the signatures in environmental samples. We primarily focused on defining sewage signals, which are associated with a high human health risk in polluted waters. To test for fecal contamination sources, the platform only requires paired-end reads targeting the V4 or V6 regions of the 16S rRNA gene. We demonstrated that we could use V4V5 reads trimmed to the V4 positions to generate the reference signature. The systematic workflow we describe to create and validate the signatures could be applied to many disciplines. With the increasing gap between advancing technology and practical applications, this platform makes sequence-based water quality assessments accessible to the public health and water resource communities.

INTRODUCTION

Environmental fecal pollution is recognized worldwide as a major threat to human health that causes millions of deaths in developing countries (1). While the waterborne disease burden is far less in developed countries, chronic fecal pollution is regularly reported (2) and leads to waterborne illness, diminished ecosystem services, and economic loss (3). In urbanized watersheds, fecal contamination can originate from multiple sources, including failing sewage infrastructure and contaminated runoff containing waste from upstream livestock, pets, and wildlife. However, gastrointestinal illness risks may differ according to the pollution source. For example, human exposure to waters contaminated by sewage or cattle waste is estimated to present a greater health risk than similar exposure to avian waste (4, 5). Water quality measures based on traditional fecal indicator bacteria do not identify pollution sources (6). As a result, more advanced methods, including quantitative PCR, have been developed to distinguish fecal contamination sources by targeting marker sequences that are indicative of a specific pollution source (7).

The ever-decreasing cost of sequencing and the development of advanced computational tools has made high-throughput sequencing one of the most efficient ways to study bacterial communities, and it shows promise for microbial source tracking applications (8, 9). Recent studies have used 16S rRNA gene amplicons to successfully characterize human and animal fecal sources in surface waters (10–16). These studies rely on the detection of fecal community signatures shaped by coevolutionary dynamics between hosts and their gut microbiota (17–20). These approaches, including use of the state-of-the-art Bayesian classifier SourceTracker (21), require the creation of a reference sequence library for fecal samples, which entails sampling and sequencing both of source samples and of the water samples of interest. Such additional effort is often beyond the capabilities or resources available to a single investigator. This limitation is compounded when fecal references are geographically distant from samples tested, which can alter the accuracy of source identification (14). To date, a publicly available database that encompasses most common animal sources for fecal contamination does not exist.

Our previous work reported a source tracking method using random forest classification, which offered a rapid, sensitive, and accurate solution for identifying host-microbial signatures in environmental samples (15). In this study, we demonstrated that the two most common gut-associated bacterial orders in mammals, i.e., Bacteroidales and Clostridiales (18, 22), provide enough host-specific signal to perform accurate source identification. Similar observations were reported at the genus level (23–25), highlighting the fractal nature of specialization between gut microbiota and their hosts. Targeting taxonomically defined groups, rather than studying the whole bacterial community, may reduce the influence of large cross-phylum shifts due to diet or sequencing primer bias in the bacterial community (25) while providing relevant host-associated profiles.

Here, we introduce the Web application FORest Enteric Source IdentifiCation (FORENSIC), an online platform with a user-friendly interface that employs random forest to identify common sources responsible for fecal contamination. FORENSIC gives access to an extensible list of human and animal reference signatures, eliminating the need for users to create their own fecal reference databases. This platform responds to an emerging demand for sequence data in the advancement of water quality testing.

RESULTS

Fecal gut microbiota were explored in 132 human (sewage) and 395 animal fecal source samples collected around the world, which included but not were not limited to those from cats, cattle, dogs, deer, pigs, horses, goats, sheep, rabbits, raccoons, chicken, and geese. Amplicon sequences generated from the V4, V4V5, and V6 regions of the 16S rRNA gene were used to create and validate global FORENSIC source signatures.

Host community patterns and performance of global classifiers.We focused on Bacteroidales and Clostridiales to perform fecal source identification, since they are the most dominant fecal microbiota and have the highest genetic diversity among mammal fecal samples that we examined (Fig. 1). Despite microdiversity and geographical patterns within some sources, both taxonomic groups provided resolution to observe source-specific structures in their assemblages (Fig. 2). This structure allowed us to create and store fecal source signatures as classifiers for both bacterial orders and multiple 16S rRNA gene variable regions. A sample was considered contaminated by a source if random forest classification could identify a bacterial structure similar to the fecal signature stored in the classifier.

FIG 1
  • Open in new tab
  • Download powerpoint
FIG 1

Phylogeny and distribution of the dominant V4 sequences across the fecal microbiota of eight host species. Ten fecal samples were averaged per host, except for dog, chicken, and goose, represented by 7, 6, and 8 samples, respectively. Only sequences (n = 287) with an average relative abundance per host of 0.5% are displayed. Colors in the inner circle depict phyla. Gray clades symbolize the dominant bacterial orders for at least one host. Log-transformed relative abundances were normalized by the maximum abundance for each host. Fuso., Fusobacteriales; Lacto., Lactobacillales. The tree was rooted using Halobellus ramosii strain S2FP14 (GenBank accession no. NR_145608.1). Sequences were aligned using MUSCLE implemented in MEGA (61). The tree was generated using the interactive Tree Of Life (iTOL) (62).

FIG 2
  • Open in new tab
  • Download powerpoint
FIG 2

Bray-Curtis analysis of the Bacteroidales assemblage for the V4 region of the 16S rRNA gene. Color code shows the host, the type of sample (train or test the classifier), the initial 16S rRNA gene region amplified, and the geographical origin of the samples. Samples were ordered by source from the most similar samples (left) to the most dissimilar (right) based on the averaged intrasource Bray-Curtis dissimilarity comparisons. h., horse. Bray-Curtis analysis of the V4 Clostridiales and V6 Bacteroidales and Clostridiales assemblages are presented in Fig. S2 and S3 in the supplemental material, respectively.

To create the initial classifiers, we used the 15 to 20 samples with the most similar Bacteroidales or Clostridiales sequence structure (based on Bray-Curtis intrasource comparisons) for the V4V5 and V6 variable regions. We then tested the specificity (i.e., proportion of correctly classified negative samples) and sensitivity (i.e., proportion of correctly classified positive samples) of these classifiers against samples with the most dissimilar Bacteroidales or Clostridiales sequence structure. Exceptions were applied when a strong community bifurcation was observed within an animal source assemblage (i.e., for V6 cattle and dog; see Materials and Methods). We maximized the performance of each classifier by tailoring the voting tree probability cutoff (called the decision cutoff) at which a sample was considered contaminated or not by a source (see Materials and Methods); the voting tree probability is a proxy reflecting the bacterial profile similarity between a tested sample and a fecal signature. The performances of classifiers and decision cutoffs are summarized in Table 1. The classifier amplicon sequence variants (ASVs), which are the highly resolved operational taxonomic units composing the individual source signatures, are listed in Table S1 in the supplemental material. All V4V5 and V6 sewage classifiers had a specificity of 100%. All sewage samples were correctly classified. All animal classifiers had a specificity ranging from 82 to 100%, except V4V5 dog Bacteroidales, for which no cutoff could be defined; this classifier was therefore discarded. Misclassifications typically involved animals with similar diets, e.g., misclassifications of horse as cattle (see Data Set S1 in the supplemental material for individual sample predictions). The fecal sources with the lowest sensitivity, e.g., dog or cattle, had a large intrasource dispersion, likely due to a larger number of atypical samples used to test the classifier performance (see nonparametric multidimensional scaling plot in Fig. S1 in the supplemental material).

View this table:
  • View inline
  • View popup
  • Download powerpoint
TABLE 1

Sensitivity and specificity of the V4V5 and V6 source classifiers

DATA SET S1

Sample metadata (separated Excel sheet). Download Data Set S1, XLSX file, 0.1 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

TABLE S1

List of the sequences composing the V4, V4V5, and V6 classifiers. Download Table S1, XLSX file, 0.1 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

FIG S1

Nonmetric multidimensional scaling plot indicating the Bacteroidales and Clostridiales assemblage composition relationships between the sewage and animal fecal samples for the V4 and V6 data sets. Download FIG S1, TIF file, 2.2 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

FIG S2

Bray-Curtis analysis of the Clostridiales assemblage for the V4 region of the 16S rRNA gene. Download FIG S2, TIF file, 2.2 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

FIG S3

Bray-Curtis analysis of the Bacteroidales and Clostridiales assemblage for the V6 region of the 16S rRNA gene. Download FIG S3, TIF file, 2.9 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

We only considered classifiers to be validated as global classifiers if the specificity and the sensitivity were ≥70% when using a voting tree decision cutoff of >10%. Reducing the voting cutoff further increases sensitivity in some cases but creates specificity tradeoffs and may generate too many misclassifications when testing environmental samples. Classifiers with a cutoff of ≤10% were considered “draft” in order to offer the users the ability to explore their data set while considering uncertainty. Classifiers with a decision cutoff of ≤5% were discarded.

Scalability of the classifiers.In addition to the host-specific classifiers, we developed diet/physiology-centered classifiers, including those for herbivore and ruminant sources. These broad classifiers allowed us to extend the scope of the host-specific classifiers to capture fecal signatures shared among hosts having similar diets or physiology. This approach also allowed us to include sources with limited sample sets that prohibited creating individual host classifiers (e.g., horse or sheep). The diet/physiology-centered classifiers had a minimum specificity and sensitivity of 78% (Table 2). Interestingly, in some cases, the broader classifiers were able to identify the fecal signature in samples (e.g., Bacteroidales V6 deer samples) where the host-specific classifiers failed (see Data Set S1).

View this table:
  • View inline
  • View popup
  • Download powerpoint
TABLE 2

Sensitivity and specificity of the V4V5 and V6 diet and physiology classifiers

Importance of sequencing depth.We evaluated the performance of the source classifiers (Bacteroidales and Clostridiales for both V4V5 and V6) in seven freshwater river samples for which sewage and cattle fecal contamination had been previously identified (26) using validated quantitative PCR (qPCR) assays (Table 3). Unlike V4V5, V6 classifiers identified the sewage signature in all the samples. The sequencing of the V6 region using the Illumina HiSeq or NextSeq platform provided 4-fold higher depth than the Illumina MiSeq platform used to sequence the V4V5 region. The two V4V5 samples not classified as contaminated with sewage (MKE_18755 and MKE_18765) had the smallest number of sequences matching the classifiers. This increased sequencing depth appeared to provide the information needed to capture the fecal signature in the environmental samples where bacteria are already abundant (in freshwater samples, bacterial abundance is typically >105 cells per ml). For this reason, we decided to focus our classifier approach on short hypervariable regions of the 16S rRNA gene to maximize sequencing depth. In addition to the V6 region, we created and tested the performance of classifiers built on the V4 region, which is targeted extensively in environmental and fecal microbiota data sets (27). We incrementally subsampled all fecal samples used to test the V4 and V6 classifiers in order to define the classifier detection limits. This analysis revealed that a minimum of 1,000 sequences was necessary to properly classify a sample. Additionally, artificial bacterial assemblages composed of several fecal signatures were also correctly classified when the fecal signature matching the classifiers was composed of at least 1,000 sequences represented by ∼100 unique representative sequences (see Table S2 in the supplemental material for individual sample predictions). The relationship between the number of sequences in a sample (proxy of the percentage of contamination) and the voting tree probability was logarithmic.

View this table:
  • View inline
  • View popup
  • Download powerpoint
TABLE 3

Random forest classifications using the V4V5 and V6 global classifiers of seven freshwater river samples contaminated by sewage and cattle fecal pollutiona

TABLE S2

In silico artificial community predictions generated using the V4 and V6 classifiers. Download Table S2, XLSX file, 0.01 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

Flexibility of sequence data.We developed and compared V4 classifiers from both de novo-produced V4 sequence data and trimmed V4V5 sequence reads. The ability to reliably use trimmed data was explored using 42 untreated sewage and 191 animal fecal samples sequenced in both the V4 and V4V5 regions. The V4 primer sites were recognized in more than 99% of Bacteroidales and Clostridiales V4V5 reads, suggesting that the V4 primer sites are highly conserved in these two bacterial orders. Since the trimmed V4V5 data set was sequenced to a lower depth, we rarefied the de novo V4 data set to the median count of trimmed V4V5 reads. We created three source classifiers, namely sewage, cattle, and pig, using both rarefied de novo V4 and trimmed V4V5 data sets. The majority of the sequences composing the classifiers were shared between the two data sets (see Table S3 in the supplemental material). Moreover, they generated comparable sensitivity and specificity metrics when tested using their respective data set and the reciprocal data set (i.e., rarified de novo V4 classifiers tested using trimmed samples and vice versa). Interestingly, a nonrarefied de novo V4 classifier had lower sensitivity when using trimmed V4V5 data, likely because of the lower sequence depth of the test data set. The performance of each configuration is described in Table S4 in the supplemental material. Thereafter, V4 classifiers exploiting both rarefied de novo V4 and trimmed V4V5 data sets were implemented in FORENSIC. The performance is described in Table 4.

View this table:
  • View inline
  • View popup
  • Download powerpoint
TABLE 4

Sensitivity and specificity of the combined V4 classifiers generated using de novo V4 and trimmed V4V5 sequences to the V4 primer positions

TABLE S3

List of the sequences composing the rarefied de novo V4 and trimmed V4V5 classifiers. Download Table S3, XLSX file, 0.05 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

TABLE S4

Comparison of the performance of the sewage, dog, and pig classifiers generated using de novo V4, rarefied de novo V4, and trimmed V4V5 data sets and tested against de novo V4 and trimmed V4V5 samples. Download Table S4, XLSX file, 0.01 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

FORENSIC implementation and user interface.FORENSIC integrates the validated classifiers to perform fecal source identification and can be accessed online at https://forensic.sfs.uwm.edu/. It requires a FASTA file (see Materials and Methods) and yields source contamination predictions in raw downloadable text and interactive visualizations. Classification relies upon an exact match with fecal signature sequences, and therefore trimming to exact V4 or V6 primer positions is crucial. Most FORENSIC analysis takes a few seconds to a few minutes, depending on the number of input sequences. FORENSIC currently encompasses a total of five global source classifiers and two physiology/diet classifiers for V4 and/or V6 regions (Tables 1, 2, and 4).

FORENSIC visualizes its predictions in multiple ways, including a color-coded table of source category predictions for each submitted sample (Fig. 3a) and an interactive association networks of ASVs and sources that allows the user to explore and disentangle the different fecal profiles recovered from the samples (Fig. 3b). This interactive visualization strategy can also help to detect potentially false-positive predictions by discerning the prevailing fecal signature(s) from other trace source signatures and/or from signatures sharing sequences with the dominant one. Users can access raw data outputs such as counts of sequences matching to each classifier, voting tree probabilities, and simplified identifications. Warning flags are displayed if samples do not meet certain criteria, such as the number of sequences in the submitted FASTA or the number of sequences (or unique representative sequences) matching to the classifiers being too low indicating insufficient sequencing depth or, in the case of no matching sequences, technical issues in data processing.

FIG 3
  • Open in new tab
  • Download powerpoint
FIG 3

Forensic interactive report, including (a) a source identification predictions table and (b) bubble plot to characterize the fecal signature in the submitted samples. (a) Colors indicate if the fecal source signature was characterized (red) or not (white or green) in the tested sample. Dark green sources indicate that at the voting tree probability observed, the specificity of the classifier was at least 80%; light green sources indicate a specificity of <80%. Global classifiers are represented by black rings. Discarded classifiers are symbolized using black circles. (b) Bubbles represent all the amplicon sequence variants (ASVs) that compose the classifiers for a given bacterial group. Blue bubbles show the ASVs recovered from the sample tested, unlike red bubbles. The sizes of the bubbles are proportional to the relative abundances of the ASV in the tested sample. Sources are represented by the outer arcs. The longer the arc is, the more that source contributed to the fecal pollution (among the sources investigated). The example shows the V4 Clostridiales profile recovered from a Chinese surface-water pond (SRA accession number SRR6037827) and classified as sewage by random forest. A total of 74 out of the 385 ASVs that compose all the classifiers were found in the sample; 55 and 14 ASVs were associated with the sewage and dog classifiers, respectively. The largest arc (representing 91% of the relative abundance of all the fecally associated ASVs) was associated with the sewage signature (bottom right). The second largest was associated with the dog and the third with the pig signature. The interactive version of this figure is available at https://forensic.sfs.uwm.edu/result/example. Data from Hägglund et al. and Hu et al. (16, 63). See Table S5 in the supplemental material for the full list of the individual predictions.

TABLE S5

Fecal predictions using the global classifiers of the sewage, animal fecal, and environmental samples from Hägglund et al. (16) and Hu et al. (63). Download Table S5, XLSX file, 0.03 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

DISCUSSION

FORENSIC is an online platform for performing fecal source identification in water from 16S rRNA gene amplicons. It leverages Bacteroidales and Clostridiales, two bacterial orders with many members that show remarkable host specificity in mammalian guts (15, 23, 28, 29). Microbial community differences that distinguish animals are substantially diet driven due to the specialization of bacterial members for specific metabolic functions (30–32). Here, we used these host-microbiota association patterns to discriminate fecal contamination with both host-based (e.g., human, cattle, pig) and/or diet/physiology-based (e.g., ruminant, herbivore) classifiers. Specific markers for both Bacteroides (33, 34) and Lachnospiraceae (35) have shown a broad geographic distribution. Consistent with these findings, we observed minimal geographic sample partitioning for the whole Bacteroidales and Clostridiales assemblages, which demonstrates that these classifiers could have global relevance.

The FORENSIC sewage classifiers had a sensitivity of 100%, meaning that all tested sewage samples were correctly classified. These samples were collected from three continents, which confirms that sewage carries a robust human fecal signature (36, 37) and that the classifiers are potentially applicable to cities worldwide with modern sewer infrastructure. The sewage signature also is unique and differentiated from the other nonintegrative (individual animal) sources, as no animal fecal samples were identified as sewage and no sewage samples were classified as an animal source.

Given that the microbial community composition of a single fecal sample can be quite distinct from that of an integrative sample (36), we wanted to test whether this approach could correctly identify sources from samples encompassing the variability expected among individual sources. By creating classifiers using only the 15 to 20 most similar samples and then testing their performance using the most dissimilar samples, we were able to examine this scenario. The random forest approach generated high sensitivity and specificity for most fecal pollution sources we tested, demonstrating the consistency and cohesion of the host fecal signature as characterized by a minimum number of samples. Misclassifications generally involved sources with similar diets or samples with atypical fecal microbiota. Given that fecal pollution in waterways is generally derived from multiple animals, we anticipate that the application of FORENSIC for actual water sample testing would produce few false-negative results and that sporadic individual atypical animals would not influence the results.

For two animal sources (cattle and dog), the community composition data split into two very distinct community types, possibly driven by the diet of the animal or other factors such as cohabitation, age, or individual variation. For instance, a large dissimilarity in the cattle microbial community composition has been observed between cattle that are grass fed versus those that are grain fed (38, 39). Therefore, specific animal groups, such as cattle, that are fed distinct diets or have other distinct lifestyle characteristics may not be suitable for a single-marker-gene or classifier approach for tracking their impact. Instead, multiple classifiers used in combination may be needed to cover the diversity of a single animal group.

Global classifiers implemented in FORENSIC were characterized from a minimum of 15 source samples and had a specificity and sensitivity of 70% and a decision cutoff of >10%. Classifiers that did not meet these requirements were either discarded or designated draft classifiers. Cats and dogs had microbial communities similar to that of sewage, indicating shared sequences (Fig. S1). This may be due to shared diet or to actual pet or wildlife inputs into the sewer conveyance system. We did not attempt to make a cat classifier, and two of the four dog classifiers were either discarded or designated draft.

FORENSIC is implemented with random forest classification, which is particularly suitable for an online application, especially since the classifiers do not require each new investigation to reanalyze de novo bacterial assemblages in all trained and tested samples. The global classifiers of FORENSIC are based on the shared/common source signatures, increasing access to sequence-based analysis for users who do not have source samples. SourceTracker (21), a well-established tool, relies upon direct comparison of a source community to the sink community and requires both communities to be used as inputs. Users would need to provide representative samples, or chose samples from publicly available data. Analysis is performed de novo, i.e., without a reference database of what to expect, and therefore does not require that a source contain a universal signature. SourceTracker is highly sensitive, particularly for atypical sources or uncharacterized sources, when the source community is available. Furthermore, since the entire community is used for comparison, SourceTracker may be more suitable to identify animals where additional taxonomic groups provide a more distinctive signature than Bacteroidales or Clostridiales. Closely related animals (cattle and horses), unique geographic signatures, or the presence of uncharacterized sources, particularly wildlife, should be considered when choosing a tool.

The scalability of FORENSIC makes it possible to add more samples to the reference database as they become available. This characteristic will allow for systematic verification of the classifier performance, with a goal of continuous improvement. Overall, future directions include (i) extending the source classifiers to include farm animals or waterfowl and targeting predominant bacterial groups appropriate to these sources, (ii) strengthening broad diet/physiology classifiers, or, inversely, (iii) characterizing finer subsource clusters to improve the sensitivity of the classifiers.

Our study also highlights that sequencing depth is crucial to capturing fecal signatures within environmental samples. Even at relatively significant contamination levels, the fecal microorganisms can be outnumbered by the native microorganisms by >2 orders of magnitude. We recommend implementing our approach on deeply sequenced data (i.e., >100,000 reads) and using sequencing platforms that generate shorter reads (e.g., V4 or V6 regions of the 16S rRNA gene) and therefore greater depth. However, FORENSIC correctly identified the source of fecal pollution in environmental samples with lower sequencing depth where the fecal signature was represented with a minimum of 1,000 sequences and 100 unique representative sequences. This rate of correct identification equates to correctly identifying the fecal signature in 100,000 reads if the signature comprises 1% of the community.

When generating the classifier, the short V4 reads performed as well as the longer V4V5 reads, which supports previous reports that minimal information was needed for resolution of microbial members (40, 41). Although primers for the region targeted are often cited for introducing bias in the assessment of microbial communities (42, 43), we demonstrated the equivalence of using V4V5 sequences trimmed to the Earth Microbiome Project V4 primer positions and of using de novo V4 sequences for the specific FORENSIC application. This result is congruent with previous studies that reported a limited bias for several phyla, including Bacteroidetes and Firmicutes (44, 45), likely because primer sites are conserved within closely related groups.

In summary, FORENSIC offers a comprehensive open-source platform that enables rapid and straightforward screening of large sequence data sets to identify fecal contamination. In the era of affordable and accessible high-throughput sequencing, FORENSIC is a valuable tool for community-based water quality assessments to prospect the main sources of fecal pollution and guide appropriate management actions.

MATERIALS AND METHODS

Sample collection and processing.Animal stool samples were collected in the United States, France, the Dominican Republic, and Australia. Samples collected in the United States were preserved in lysis buffer from the stool extraction kit and shipped to the laboratory on ice and stored at −80°C upon arrival. Dominican Republic samples were stored at −20°C until extracted. Australian fecal sample processing and storage were described previously, and extracted DNA was provided (46). Animal stool samples from France were freeze-dried prior to DNA extraction. A human fecal signature was characterized from sewage influent samples from 71 cities across the United States (36), Reus in Spain (47), and Sydney in Australia. Finally, seven river samples collected after heavy rainfall in the Milwaukee River Basin, Milwaukee, WI, USA, were used to test the sensitivity of the classifiers (26). Sewage and animal fecal sample details are reported in Data Set S1.

Sample processing and DNA extraction.For the U.S. and Dominican Republic fecal samples, bacterial DNA was extracted from approximately 1 g of material using a QIAmp DNA stool minikit according to the manufacturer’s instructions (Qiagen, Valencia, CA) as described (25). About 200 mg of Australian stool samples was extracted using the QIAmp stool DNA kit, while about 500 mg of dry fecal sample for French samples was extracted using the FastDNA spin kit for soil (MP Biomedicals, Carlsbad, CA). Australian and French extracted DNA was freeze-dried before being shipped to the United States.

In total, 25 ml for sewage influent and between 200 and 400 ml for freshwater samples was filtered onto 0.22-μm mixed cellulose ester filters with a 47-mm diameter (Millipore, USA). DNA from filters was then extracted using the FastDNA spin kit for soil (MP Biomedicals, Solon, OH) as previously described (26, 36). DNA concentration was assessed on a Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA).

16S rRNA gene sequencing and library construction.Amplicon libraries were constructed at the Josephine Bay Paul Center in the Marine Biological Laboratory (Woods Hole, MA) and/or the Great Lake Genomics Center (Milwaukee, WI) using the MiSeq Illumina platform for the V4 and V4V5 hypervariable regions and the HiSeq and/or NextSeq Illumina platforms for V6 hypervariable regions. Details for amplicon library construction and sequencing procedures for the V4V5 and V6 regions were followed as previously described (48, 49). The construction of the V4 amplicon library using the Earth Microbiome primers is described in Text S1 in the supplemental material.

TEXT S1

V4 amplification and sequencing. Download Text S1, DOCX file, 0.02 MB.
Copyright © 2020 Roguet et al.

This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

Reads were trimmed using cutadapt v1.14 (50), allowing for two to four mismatches in the primer sequence depending on the region. Forward and reverse reads were merged using PEAR v0.9.10 (51) with the default parameters. Merged reads with a length shorter or longer than 5% of the V4V5 median (372 bp) or 10% of the V4 and V6 median (i.e., 253 and 60 bp, respectively) were removed during this step. Quality filtering was assessed using mothur 1.39.5 (52). Assembled reads containing ambiguous bases or with more than eight successive homopolymers were discarded. Sequences were taxonomically assigned based on the best match in a Global Alignment for Sequence Taxonomy (GAST) (53) process using the Silva 132 database (54). Only sequences assigned to Bacteroidales and Clostridiales were selected for further analysis.

Minimum entropy decomposition analysis.Minimum entropy decomposition (MED) analysis was performed using the oligotyping pipeline version 2.1 (55). MED uses nucleotide entropy (nucleotide variant variability) along DNA sequences to distinguish differences in nucleotides originating as a result of true genetic variation among organisms from noise due to sequencing errors (56). MED partitions DNA sequences into amplicon sequence variants (ASVs) according to the position in the DNA sequence with the highest entropy. This step iteratively lasts until each final ASV satisfies the maximum entropy criterion. ASVs that do not meet the minimum substantive abundance (M) criterion were discarded. M was set to N/30,000, N/40,000, and N/60,000, for the V4V5, V4, and V6 data sets, respectively, where N was the total number of sequences in the data set.

Selection of the samples to train and test the classifiers.Sequencing quality was systematically evaluated for each sewage and animal fecal sample before creating or testing the classifiers. For that, a prospective MED analysis was performed for both Bacteroidales and Clostridiales using all the samples of a given source. Samples in which more than 50% of the reads were discarded during the MED analysis for at least one of the bacterial groups were deleted from further analysis. The total numbers of amplicon sequence variants (ASVs) were log10 transformed and plotted in a box plot. Lower outlier samples, i.e., those lower than 1.5 times the interquartile, were considered to have low sequencing depth and were discarded. We also verified for each source that there was no sample with an extremely atypical bacterial assemblage using Bray-Curtis dissimilarities computed separately for each source and bacterial group via the vegan package (57) implemented in R. Intrasource dissimilarities were averaged per sample and plotted in a box plot; upper outliers were not used to evaluate the performance of the classifier but are reported separately in Data Set S1. These extreme outliers comprised less than 8% of the total samples and, in some cases, had Bray-Curtis dissimilarity scores of >0.95 compared to other samples from a given host, which could have resulted from sample handling, external contamination, or poor sequencing quality.

Each classifier was built using 15 to 20 samples having the most similar intrasource bacterial assemblages (based on Bray-Curtis dissimilarity index). The validation of the classifier predictions was assessed with a minimum of six samples (30% minimum of the samples per source) having the most dissimilar intrasource bacterial assemblages to ensure a stringent assessment of performance. We systematically explored the intrasource variability of the bacterial group assemblages using a hierarchical clustering analysis (based on the Bray-Curtis dissimilarity and Ward linkage) in order to take into account environmental factors that may result in bifurcation of gut microbiota structure within sources. We defined subgroups within sources when the branch length between two clusters for a given source was greater than five. A relevant shift was observed in the V6 cattle (Bacteroidales and Clostridiales) and V6 dog (Bacteroidales) bacterial assemblages. A balanced number of samples from each cluster was used to build the V6 cattle and dog (Bacteroidales) classifiers. Diet and physiology classifiers were trained and tested using the collection of samples for each respective category (e.g., ruminant trained and tested using cattle and deer samples).

Random forest classifications.To characterize the fecal source signature for a classifier, bacterial assemblages from a given source (e.g., sewage) were compared to the assemblages from all other source samples (e.g., nonsewage samples from dog, cattle, etc.). For that, 100 random forests consisting of 10,000 trees were computed per source and per bacterial group using the default settings of the ‘randomForest’ function implemented in the randomForest R package (58). Mean decreases in Gini were averaged for each ASV among the 100 random forest replicates. Mean decreases in Gini were plotted in a scree plot. ASVs with mean decreases in Gini above the breakpoint curve were chosen to be part of the classifier. Breakpoints were estimated using the “breakpoints” function included in the strucchange R package (59). Classifiers were trained with 100 forests constituted of 1,000 trees to recognize the fecal source signatures from the ASVs selected upstream (per source and bacterial group, the total relative abundance of the ASVs was of 100%). Replicates were pooled using the “combine” function. One classifier was created for each source, bacterial group, and region of the 16S rRNA gene. To identify the fecal source signature in a test sample, sequences matching the ones in the classifiers were first extracted. Bacterial profiles were then compared to the ones stored in the classifiers using the “predict” function, which generated a voting tree probability, translating the degree of similarity between the test and the classifier profiles. We defined the voting tree probability cutoff (called the decision cutoff) to maximize the sensitivity and specificity with a minimum of 70% for each classifier. Samples associated with a voting tree higher than the decision cutoff for a given source were considered to be contaminated by that source. In addition to the source-specific classifiers (e.g., sewage, cattle), we created diet/physiology classifiers from multiple fecal source samples: V4/V4V5 ruminant classifiers were created using cattle and deer fecal samples, and a V4 herbivore classifier was created using cattle, deer, and horse fecal samples. Although V6 herbivore classifiers were created using ruminant fecal samples (i.e., cattle and deer), their specificity to only detect the ruminant fecal signature was low compared to that for detecting the herbivore signature (data not shown). Thus, these classifiers were considered herbivore classifiers. All analyses were conducted using the statistical environment R (version 3.6.0) (60).

Sensitivity, specificity, and library size detection limit of random forest classifications.Sensitivity was calculated as the number of true positives (TPs) (i.e., when a source was correctly detected) divided by the sum of TPs and false negatives (FNs) (i.e., when a source was not detected as expected). In contrast, specificity was defined as the number of true negatives (TNs) (i.e., when a source did not have a classifier and thus should have been classified as “unknown”) divided by the sum of TNs and false positives (FPs) (i.e., when an unexpected source was detected). Classifier performance for specificity was evaluated using the test samples from host groups that were the focus of the classifier, but with also seven additional animal fecal sources; see Data Set S1 for details. The minimum number of sequences at which a classifier can correctly classify a fecal sample was assessed using an in silico analysis. All V4 and V6 test samples were subsampled at 90, 80, 70, 60, 50, 40, 30, 20, 10, 5, 1, 0.5, 0.1, 0.05, 0.01, 0.005, and 0.001%. The subsampling was performed using the function “sub.sample” in mothur 1.39.5. Additionally, the minimum number of sequences belonging to the classifier signatures at which a classifier can correctly identify multiple fecal sources was assessed using artificial bacterial assemblages generated in silico. Defined sequence proportions from pristine surface freshwater and fecal samples (i.e., sewage, cattle, pig, and sheep) not used to train the classifiers were combined to create V4 and V6 artificial bacterial community mixes (see Table S2 in the supplemental material). Sequences from each sample were selected by randomly subsampling (99 repeats) using the “rrarefy” function included in the vegan R package.

Implementation of the online interface.The FORENSIC Web interface (https://forensic.sfs.uwm.edu/) was designed using free and open software development practices in mind. The source code of FORENSIC was implemented in Python programming language using the Flask Web Framework (https://palletsprojects.com/p/flask/) for easy maintenance. FORENSIC can run as a Web application behind a reverse proxy (such as NGinx) and on Web server (such as Apache; http://apache.org/), and can be deployed on any Unix-based operating system. The reverse proxy handles Hypertext Transfer Protocol Secure (HTTPS) encryption for security, sanitization of malformed inputs, and efficient data upload. FORENSIC requires the upload of a multi-FASTA file and assumes the sequencing reads to have been trimmed and merged. The Python Web application detects the 16S rRNA region targeted in sequences and generates a count table of the ASVs matching the sequences in the classifiers. After the classification, the Web platform generates a static report using the data generated. The interactive report page visualizations are created using D3.js (https://d3js.org/) and our ad hoc JavaScript code. The data used for visualizations can also be directly downloaded from the interface as tab-separated data files for user-directed analyses. The web platform also notifies the user via e-mail when analysis is complete if an e-mail address has been provided during submission. A unique identifier (ID) is also provided during the submission to review the results for up to 1 year.

Data availability.The source code of FORENSIC is available online (https://github.com/mclellanlab/web_forensic) and licensed with the General Public License to allow local and institutional deployments. The R scripts to create, train, and test the classifiers are available on figshare, as well as the script to determine the decision cutoff (https://doi.org/10.6084/m9.figshare.11231627). Raw read files were uploaded to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) and are accessible under the BioProject accession numbers PRJNA261344, PRJNA264400, PRJNA433408, PRJNA591968, and PRJNA591970 for V4V5; PRJNA591976 and PRJNA591975 for V4; and PRJNA235337, PRJNA433407, PRJNA591949, and PRJNA505345 for V6. Data Set S1 contains the list of samples and their corresponding SRA study accession numbers.

ACKNOWLEDGMENTS

We thank Warish Ahmed and Lucas Françoise for providing us with animal fecal sample DNA. We also thank Mahdi Belcaid and Jill S. McClary for their insightful discussions.

This project was funded under National Institutes of Health (NIH) grant R01 AI091829 to S.L.M.

FOOTNOTES

    • Received December 12, 2019.
    • Accepted February 21, 2020.
  • Copyright © 2020 Roguet et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.

REFERENCES

  1. 1.↵
    World Health Organization. 2011. Guidelines for drinking-water quality, 4th ed. World Health Organization, Geneva, Switzerland.
  2. 2.↵
    1. Santo Domingo JW,
    2. Ashbolt NJ
    . 2008. Fecal pollution of water, 1–12. In Cleveland CJ (ed), Encyclopedia of Earth. National Council for Science and the Environment, Washington, DC.
  3. 3.↵
    1. DeFlorio-Barker S,
    2. Wing C,
    3. Jones RM,
    4. Dorevitch S
    . 2018. Estimate of incidence and cost of recreational waterborne illness on United States surface waters. Environ Heal A Glob Access Sci Source 17:3.
    OpenUrl
  4. 4.↵
    1. Soller JA,
    2. Schoen ME,
    3. Bartrand T,
    4. Ravenscroft JE,
    5. Ashbolt NJ
    . 2010. Estimated human health risks from exposure to recreational waters impacted by human and non-human sources of faecal contamination. Water Res 44:4674–4691. doi:10.1016/j.watres.2010.06.049.
    OpenUrlCrossRefPubMed
  5. 5.↵
    1. Schoen ME,
    2. Ashbolt NJ
    . 2010. Assessing pathogen risk to swimmers at non-sewage impacted recreational beaches. Environ Sci Technol 44:2286–2291. doi:10.1021/es903523q.
    OpenUrlCrossRefPubMed
  6. 6.↵
    1. Willey JM,
    2. Sherwood LM,
    3. Woolverton CJ
    . 2017. Proteobacteria, p 504–534. In Prescott’s microbiology, 10th ed. McGraw-Hill, New York, NY.
  7. 7.↵
    1. Harwood VJ,
    2. Staley C,
    3. Badgley BD,
    4. Borges K,
    5. Korajkic A
    . 2014. Microbial source tracking markers for detection of fecal contamination in environmental waters: relationships between pathogens and human health outcomes. FEMS Microbiol Rev 38:1–40. doi:10.1111/1574-6976.12031.
    OpenUrlCrossRefPubMed
  8. 8.↵
    1. Unno T,
    2. Staley C,
    3. Brown CM,
    4. Han D,
    5. Sadowsky MJ,
    6. Hur H
    . 2018. Fecal pollution: new trends and challenges in microbial source tracking using next-generation sequencing. Environ Microbio 20:3132–3140. doi:10.1111/1462-2920.14281.
    OpenUrlCrossRef
  9. 9.↵
    1. Vierheilig J,
    2. Savio D,
    3. Farnleitner AH,
    4. Reischer GH,
    5. Ley RE,
    6. Mach RL,
    7. Farnleitner AH,
    8. Reischer GH
    . 2015. Potential applications of next generation DNA sequencing of 16S rRNA gene amplicons in microbial water quality monitoring. Water Sci Technol 72:1962–1972. doi:10.2166/wst.2015.407.
    OpenUrlAbstract/FREE Full Text
  10. 10.↵
    1. Henry R,
    2. Schang C,
    3. Coutts S,
    4. Kolotelo P,
    5. Prosser T,
    6. Crosbie N,
    7. Grant T,
    8. Cottam D,
    9. O’Brien P,
    10. Deletic A,
    11. McCarthy D
    . 2016. Into the deep: evaluation of SourceTracker for assessment of faecal contamination of coastal waters. Water Res 93:242–253. doi:10.1016/j.watres.2016.02.029.
    OpenUrlCrossRef
  11. 11.↵
    1. Brown CM,
    2. Staley C,
    3. Wang P,
    4. Dalzell B,
    5. Chun CL,
    6. Sadowsky MJ
    . 2017. A high-throughput DNA sequencing approach to determine sources of fecal bacteria in a Lake Superior estuary. Environ Sci Technol 51:8263–8271. doi:10.1021/acs.est.7b01353.
    OpenUrlCrossRef
  12. 12.↵
    1. Comte J,
    2. Berga M,
    3. Severin I,
    4. Logue JB,
    5. Lindström ES
    . 2017. Contribution of different bacterial dispersal sources to lakes: population and community effets in different seasons. Environ Microbiol 19:2391–2404. doi:10.1111/1462-2920.13749.
    OpenUrlCrossRef
  13. 13.↵
    1. McCarthy D,
    2. Jovanovic D,
    3. Lintern A,
    4. Teakle I,
    5. Barnes M,
    6. Deletic A,
    7. Coleman R,
    8. Rooney G,
    9. Prosser T,
    10. Coutts S,
    11. Hipsey MR,
    12. Bruce LC,
    13. Henry R
    . 2017. Source tracking using microbial community fingerprints: method comparison with hydrodynamic modelling. Water Res 109:253–265. doi:10.1016/j.watres.2016.11.043.
    OpenUrlCrossRef
  14. 14.↵
    1. Staley C,
    2. Kaiser T,
    3. Lobos A,
    4. Ahmed W,
    5. Harwood VJ,
    6. Brown CM,
    7. Sadowsky MJ
    . 2018. Application of SourceTracker for accurate identification of fecal pollution in recreational freshwater: a double-blinded study. Environ Sci Technol 52:4207–4217. doi:10.1021/acs.est.7b05401.
    OpenUrlCrossRef
  15. 15.↵
    1. Roguet A,
    2. Eren AM,
    3. Newton RJ,
    4. McLellan SL
    . 2018. Fecal source identification using random forest. Microbiome 6:185. doi:10.1186/s40168-018-0568-3.
    OpenUrlCrossRefPubMed
  16. 16.↵
    1. Hägglund M,
    2. Bäckman S,
    3. Macellaro A,
    4. Lindgren P,
    5. Borgmästars E,
    6. Jacobsson K,
    7. Dryselius R,
    8. Stenberg P,
    9. Sjödin A,
    10. Forsman M,
    11. Ahlinder J
    . 2018. Accounting for bacterial overlap between raw water communities and contaminating sources improves the accuracy of signature-based microbial source tracking. Front Microbiol 9:2364. doi:10.3389/fmicb.2018.02364.
    OpenUrlCrossRef
  17. 17.↵
    1. Ley RE,
    2. Hamady M,
    3. Lozupone C,
    4. Turnbaugh PJ,
    5. Ramey RR,
    6. Bircher JS,
    7. Schlegel ML,
    8. Tucker TA,
    9. Schrenzel MD,
    10. Knight R,
    11. Gordon JI
    . 2008. Evolution of mammals and their gut microbes. Science 320:1647–1652. doi:10.1126/science.1155725.
    OpenUrlAbstract/FREE Full Text
  18. 18.↵
    1. Kostic AD,
    2. Howitt MR,
    3. Garrett WS
    . 2013. Exploring host-microbiota interactions in animal models and humans. Genes Dev 27:701–718. doi:10.1101/gad.212522.112.
    OpenUrlAbstract/FREE Full Text
  19. 19.↵
    1. Yoon SS,
    2. Kim EK,
    3. Lee WJ
    . 2015. Functional genomic and metagenomic approaches to understanding gut microbiota-animal mutualism. Curr Opin Microbiol 24:38–46. doi:10.1016/j.mib.2015.01.007.
    OpenUrlCrossRefPubMed
  20. 20.↵
    1. Nishida AH,
    2. Ochman H
    . 2018. Rates of gut microbiome divergence in mammals. Mol Ecol 27:1884–1897. doi:10.1111/mec.14473.
    OpenUrlCrossRef
  21. 21.↵
    1. Knights D,
    2. Kuczynski J,
    3. Charlson ES,
    4. Zaneveld J,
    5. Mozer MC,
    6. Collman RG,
    7. Bushman FD,
    8. Knight R,
    9. Kelley ST
    . 2011. Bayesian community-wide culture-independent microbial source tracking. Nat Methods 8:761–763. doi:10.1038/nmeth.1650.
    OpenUrlCrossRefPubMedWeb of Science
  22. 22.↵
    1. Dethlefsen L,
    2. McFall-Ngai MJ,
    3. Relman DA
    . 2007. An ecological and evolutionary perspective on human-microbe mutualism and disease. Nature 449:811–818. doi:10.1038/nature06245.
    OpenUrlCrossRefPubMedWeb of Science
  23. 23.↵
    1. Fisher JC,
    2. Eren AM,
    3. Green HC,
    4. Shanks OC,
    5. Morrison HG,
    6. Vineis JH,
    7. Sogin ML,
    8. McLellan SL
    . 2015. Comparison of sewage and animal fecal microbiomes by using oligotyping reveals potential human fecal indicators in multiple taxonomic groups. Appl Environ Microbiol 81:7023–7033. doi:10.1128/AEM.01524-15.
    OpenUrlAbstract/FREE Full Text
  24. 24.↵
    1. Eren AM,
    2. Sogin ML,
    3. Morrison HG,
    4. Vineis JH,
    5. Fisher JC,
    6. Newton RJ,
    7. McLellan SL
    . 2015. A single genus in the gut microbiome reflects host preference and specificity. ISME J 9:90–100. doi:10.1038/ismej.2014.97.
    OpenUrlCrossRefPubMed
  25. 25.↵
    1. Feng S,
    2. Ahmed W,
    3. McLellan SL
    . 2020. Ecological and technical mechanisms for cross reaction of human fecal indicators with animal hosts. Appl Environ Microbiol 86:e02319-19. doi:10.1128/AEM.02319-19.
    OpenUrlAbstract/FREE Full Text
  26. 26.↵
    1. Olds HT,
    2. Corsi SR,
    3. Dila DK,
    4. Halmo KM,
    5. Bootsma MJ,
    6. McLellan SL
    . 2018. High levels of sewage contamination released from urban areas after storm events: a quantitative survey with sewage specific bacterial indicators. PLoS Med 15:e1002614. doi:10.1371/journal.pmed.1002614.
    OpenUrlCrossRef
  27. 27.↵
    1. Thompson LR,
    2. Sanders JG,
    3. McDonald D,
    4. Amir A,
    5. Ladau J,
    6. Locey KJ,
    7. Prill RJ,
    8. Tripathi A,
    9. Gibbons SM,
    10. Ackermann G,
    11. Navas-Molina JA,
    12. Janssen S,
    13. Kopylova E,
    14. Vázquez-Baeza Y,
    15. González A,
    16. Morton JT,
    17. Mirarab S,
    18. Zech Xu Z,
    19. Jiang L,
    20. Haroon MF,
    21. Kanbar J,
    22. Zhu Q,
    23. Jin Song S,
    24. Kosciolek T,
    25. Bokulich NA,
    26. Lefler J,
    27. Brislawn CJ,
    28. Humphrey G,
    29. Owens SM,
    30. Hampton-Marcell J,
    31. Berg-Lyons D,
    32. McKenzie V,
    33. Fierer N,
    34. Fuhrman JA,
    35. Clauset A,
    36. Stevens RL,
    37. Shade A,
    38. Pollard KS,
    39. Goodwin KD,
    40. Jansson JK,
    41. Gilbert JA,
    42. Knight R
    , Consortium TEMP. 2017. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature 551:457–463. doi:10.1038/nature24621.
    OpenUrlCrossRefPubMed
  28. 28.↵
    1. McLellan SL,
    2. Eren AM
    . 2014. Discovering new indicators of fecal pollution. Trends Microbiol 22:697–706. doi:10.1016/j.tim.2014.08.002.
    OpenUrlCrossRefPubMed
  29. 29.↵
    1. Ley RE,
    2. Lozupone CA,
    3. Hamady M,
    4. Knight R,
    5. Gordon JI
    . 2008. Worlds within worlds: evolution of the vertebrate gut microbiota. Nat Rev Microbiol 6:776–788. doi:10.1038/nrmicro1978.
    OpenUrlCrossRefPubMedWeb of Science
  30. 30.↵
    1. Zou Y,
    2. Xue W,
    3. Luo G,
    4. Deng Z,
    5. Qin P,
    6. Guo R,
    7. Sun H,
    8. Xia Y,
    9. Liang S,
    10. Dai Y,
    11. Wan D,
    12. Jiang R,
    13. Su L,
    14. Feng Q,
    15. Jie Z,
    16. Guo T,
    17. Xia Z,
    18. Liu C,
    19. Yu J,
    20. Lin Y,
    21. Tang S,
    22. Huo G,
    23. Xu X,
    24. Hou Y,
    25. Liu X,
    26. Wang J,
    27. Yang H,
    28. Kristiansen K,
    29. Li J,
    30. Jia H,
    31. Xiao L
    . 2019. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat Biotechnol 37:179–185. doi:10.1038/s41587-018-0008-8.
    OpenUrlCrossRef
  31. 31.↵
    1. La Reau AJ,
    2. Meier-Kolthoff JP,
    3. Suen G
    . 2016. Sequence-based analysis of the genus Ruminococcus resolves its phylogeny and reveals strong host association. Microb Genom 2:e000099. doi:10.1099/mgen.0.000099.
    OpenUrlCrossRef
  32. 32.↵
    1. Krishnan S,
    2. Alden N,
    3. Lee K
    . 2015. Pathways and functions of gut microbiota metabolism impacting host physiology. Curr Opin Biotechnol 36:137–145. doi:10.1016/j.copbio.2015.08.015.
    OpenUrlCrossRef
  33. 33.↵
    1. Reischer GH,
    2. Kasper DC,
    3. Steinborn R,
    4. Farnleitner AH,
    5. Mach RL
    . 2007. A quantitative real-time PCR assay for the highly sensitive and specific detection of human faecal influence in spring water from a large alpine catchment area. Lett Appl Microbiol 44:351–356. doi:10.1111/j.1472-765X.2006.02094.x.
    OpenUrlCrossRefPubMedWeb of Science
  34. 34.↵
    1. Feng S,
    2. McLellan SL
    . 2019. Highly specific sewage-derived Bacteroides quantitative PCR assays target sewage-polluted waters. Appl Environ Microbiol 85:e02696-18. doi:10.1128/AEM.02696-18.
    OpenUrlAbstract/FREE Full Text
  35. 35.↵
    1. Feng S,
    2. Bootsma M,
    3. Mclellan SL
    . 2018. Novel human-associated Lachnospiraceae genetic markers improve detection of fecal pollution sources in urban waters. Appl Environ Microbiol 84:e00309-18. doi:10.1128/AEM.00309-18.
    OpenUrlAbstract/FREE Full Text
  36. 36.↵
    1. Newton RJ,
    2. McLellan SL,
    3. Dila DK,
    4. Vineis JH,
    5. Morrison HG,
    6. Murat Eren A,
    7. Sogin ML
    . 2015. Sewage reflects the microbiomes of human populations. mBio 6:e02574–14. doi:10.1128/mBio.02574-14.
    OpenUrlCrossRef
  37. 37.↵
    1. McLellan SL,
    2. Newton RJ,
    3. Vandewalle JL,
    4. Shanks OC,
    5. Huse SM,
    6. Eren AM,
    7. Sogin ML
    . 2013. Sewage reflects the distribution of human faecal Lachnospiraceae. Environ Microbiol 15:2213–2227. doi:10.1111/1462-2920.12092.
    OpenUrlCrossRef
  38. 38.↵
    1. Shanks OC,
    2. Kelty CA,
    3. Archibeque S,
    4. Jenkins M,
    5. Newton RJ,
    6. McLellan SL,
    7. Huse SM,
    8. Sogin ML
    . 2011. Community structures of fecal bacteria in cattle from different animal feeding operations. Appl Environ Microbiol 77:2992–3001. doi:10.1128/AEM.02988-10.
    OpenUrlAbstract/FREE Full Text
  39. 39.↵
    1. Hagey JV,
    2. Bhatnagar S,
    3. Heguy JM,
    4. Karle BM,
    5. Price PL,
    6. Meyer D,
    7. Maga EA
    . 2019. Fecal microbial communities in a large representative cohort of California dairy cows. Front Microbiol 10:1093. doi:10.3389/fmicb.2019.01093.
    OpenUrlCrossRef
  40. 40.↵
    1. Jeraldo P,
    2. Chia N,
    3. Goldenfeld N
    . 2011. On the suitability of short reads of 16S rRNA for phylogeny-based analyses in environmental surveys. Environ Microbiol 13:3000–3009. doi:10.1111/j.1462-2920.2011.02577.x.
    OpenUrlCrossRefPubMedWeb of Science
  41. 41.↵
    1. Liu Z,
    2. Lozupone C,
    3. Hamady M,
    4. Bushman FD,
    5. Knight R
    . 2007. Short pyrosequencing reads suffice for accurate microbial community analysis. Nucleic Acids Res 35:e120. doi:10.1093/nar/gkm541.
    OpenUrlCrossRefPubMed
  42. 42.↵
    1. Tremblay J,
    2. Singh K,
    3. Fern A,
    4. Kirton ES,
    5. He S,
    6. Woyke T,
    7. Lee J,
    8. Chen F,
    9. Dangl JL,
    10. Tringe SG
    . 2015. Primer and platform effects on 16S rRNA tag sequencing. Front Microbiol 6:771. doi:10.3389/fmicb.2015.00771.
    OpenUrlCrossRefPubMed
  43. 43.↵
    1. Rintala A,
    2. Pietilä S,
    3. Munukka E,
    4. Eerola E,
    5. Pursiheimo JP,
    6. Laiho A,
    7. Pekkala S,
    8. Huovinen P
    . 2017. Gut microbiota analysis results are highly dependent on the 16s rRNA gene target region, whereas the impact of DNA extraction is minor. J Biomol Tech 28:19–30. doi:10.7171/jbt.17-2801-003.
    OpenUrlCrossRef
  44. 44.↵
    1. Mao DP,
    2. Zhou Q,
    3. Chen CY,
    4. Quan ZX
    . 2012. Coverage evaluation of universal bacterial primers using the metagenomic datasets. BMC Microbiol 12:66. doi:10.1186/1471-2180-12-66.
    OpenUrlCrossRefPubMed
  45. 45.↵
    1. Ghyselinck J,
    2. Pfeiffer S,
    3. Heylen K,
    4. Sessitsch A,
    5. De Vos P
    . 2013. The effect of primer choice and short read sequences on the outcome of 16S rRNA gene based diversity studies. PLoS One 8:e71360. doi:10.1371/journal.pone.0071360.
    OpenUrlCrossRefPubMed
  46. 46.↵
    1. Ahmed W,
    2. Goonetilleke A,
    3. Powell D,
    4. Chauhan K,
    5. Gardner T
    . 2009. Comparison of molecular markers to detect fresh sewage in environmental waters. Water Res 43:4908–4917. doi:10.1016/j.watres.2009.09.047.
    OpenUrlCrossRefPubMed
  47. 47.↵
    1. Fisher JC,
    2. Levican A,
    3. Figueras MJ,
    4. McLellan SL
    . 2014. Population dynamics and ecology of Arcobacter in sewage. Front Microbiol 5:525. doi:10.3389/fmicb.2014.00525.
    OpenUrlCrossRefPubMed
  48. 48.↵
    1. Morrison HG,
    2. Grim SL,
    3. Vineis JH,
    4. Sogin ML
    . 2013. 16S amplicon Illumina sequencing methods. figshare https://doi.org/10.6084/m9.figshare.833944.v1. Retrieved 12 Dec 2017.
  49. 49.↵
    1. Eren AM,
    2. Vineis JH,
    3. Morrison HG,
    4. Sogin ML
    . 2013. A filtering method to generate high quality short reads using Illumina paired-end technology. PLoS One 8:e66643. doi:10.1371/journal.pone.0066643.
    OpenUrlCrossRefPubMed
  50. 50.↵
    1. Martin M
    . 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17:10–12. doi:10.14806/ej.17.1.200.
    OpenUrlCrossRefPubMed
  51. 51.↵
    1. Zhang J,
    2. Kobert K,
    3. Flouri T,
    4. Stamatakis A
    . 2014. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30:614–620. doi:10.1093/bioinformatics/btt593.
    OpenUrlCrossRefPubMedWeb of Science
  52. 52.↵
    1. Schloss PD,
    2. Westcott SL,
    3. Ryabin T,
    4. Hall JR,
    5. Hartmann M,
    6. Hollister EB,
    7. Lesniewski RA,
    8. Oakley BB,
    9. Parks DH,
    10. Robinson CJ,
    11. Sahl JW,
    12. Stres B,
    13. Thallinger GG,
    14. Van Horn DJ,
    15. Weber CF
    . 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541. doi:10.1128/AEM.01541-09.
    OpenUrlAbstract/FREE Full Text
  53. 53.↵
    1. Huse SM,
    2. Dethlefsen L,
    3. Huber JA,
    4. Mark Welch D,
    5. Welch DM,
    6. Relman DA,
    7. Sogin ML
    . 2008. Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet 4:e1000255. doi:10.1371/journal.pgen.1000255.
    OpenUrlCrossRefPubMed
  54. 54.↵
    1. Quast C,
    2. Pruesse E,
    3. Yilmaz P,
    4. Gerken J,
    5. Schweer T,
    6. Yarza P,
    7. Peplies J,
    8. Glöckner FO
    . 2013. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 41:590–596. doi:10.1093/nar/gks1219.
    OpenUrlCrossRef
  55. 55.↵
    1. Eren AM,
    2. Maignien L,
    3. Sul WJ,
    4. Murphy LG,
    5. Grim SL,
    6. Morrison HG,
    7. Sogin ML
    . 2013. Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data. Methods Ecol Evol 4:1111–1119. doi:10.1111/2041-210X.12114.
    OpenUrlCrossRef
  56. 56.↵
    1. Eren AM,
    2. Morrison HG,
    3. Lescault PJ,
    4. Reveillaud J,
    5. Vineis JH,
    6. Sogin ML
    . 2015. Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. ISME J 9:968–979. doi:10.1038/ismej.2014.195.
    OpenUrlCrossRefPubMed
  57. 57.↵
    1. Oksanen J,
    2. Blanchet FG,
    3. Friendly M,
    4. Kindt R,
    5. Legendre P,
    6. McGlinn D,
    7. Minchin RP,
    8. O’Hara RB,
    9. Simpson GL,
    10. Solymos P,
    11. Stevens MH,
    12. Szoecs E,
    13. Wagner H
    . 2017. vegan: community ecology package version 2.5-6.
  58. 58.↵
    1. Liaw A,
    2. Wiener M
    . 2002. Classification and regression by randomForest. R News 2/3:18–22.
    OpenUrl
  59. 59.↵
    1. Zeileis A,
    2. Leisch F,
    3. Homik K,
    4. Kleiber C
    . 2002. strucchange: an R package for testing for structural change in linear regression models. J Stat Softw 7:1–38.
    OpenUrlCrossRef
  60. 60.↵
    R Core Team. 2019. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  61. 61.↵
    1. Kumar S,
    2. Stecher G,
    3. Tamura K
    . 2016. MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Mol Biol Evol 33:1870–1874. doi:10.1093/molbev/msw054.
    OpenUrlCrossRefPubMed
  62. 62.↵
    1. Letunic I,
    2. Bork P
    . 2016. Interactive Tree of Life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res 44:W242–W245. doi:10.1093/nar/gkw290.
    OpenUrlCrossRefPubMed
  63. 63.↵
    1. Hu A,
    2. Li S,
    3. Zhang L,
    4. Wang H,
    5. Yang J,
    6. Luo Z,
    7. Rashid A,
    8. Chen S,
    9. Huang W,
    10. Yu C-P
    . 2018. Prokaryotic footprints in urban water ecosystems: a case study of urban landscape ponds in a coastal city, China. Environ Pollut 242:1729–1739. doi:10.1016/j.envpol.2018.07.097.
    OpenUrlCrossRef
PreviousNext
Back to top
Download PDF
Citation Tools
FORENSIC: an Online Platform for Fecal Source Identification
Adélaïde Roguet, Özcan C. Esen, A. Murat Eren, Ryan J. Newton, Sandra L. McLellan
mSystems Mar 2020, 5 (2) e00869-19; DOI: 10.1128/mSystems.00869-19

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Print
Alerts
Sign In to Email Alerts with your Email Address
Email

Thank you for sharing this mSystems article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
FORENSIC: an Online Platform for Fecal Source Identification
(Your Name) has forwarded a page to you from mSystems
(Your Name) thought you would be interested in this article in mSystems.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
FORENSIC: an Online Platform for Fecal Source Identification
Adélaïde Roguet, Özcan C. Esen, A. Murat Eren, Ryan J. Newton, Sandra L. McLellan
mSystems Mar 2020, 5 (2) e00869-19; DOI: 10.1128/mSystems.00869-19
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Top
  • Article
    • ABSTRACT
    • INTRODUCTION
    • RESULTS
    • DISCUSSION
    • MATERIALS AND METHODS
    • ACKNOWLEDGMENTS
    • FOOTNOTES
    • REFERENCES
  • Figures & Data
  • Info & Metrics
  • PDF

KEYWORDS

microbial source tracking
16S rRNA gene
high-throughput sequencing
Bacteroidales
Clostridiales
random forest classification
toolkit

Related Articles

Cited By...

About

  • About mSystems
  • Author Videos
  • Board of Editors
  • Policies
  • Overleaf Pilot
  • For Reviewers
  • For the Media
  • For Librarians
  • For Advertisers
  • Alerts
  • RSS
  • FAQ
  • Permissions
  • Journal Announcements

Authors

  • ASM Author Center
  • Submit a Manuscript
  • Author Warranty
  • Types of Articles
  • Getting Started
  • Ethics
  • Contact Us

Follow #mSystemsJ

@ASMicrobiology

       

 

ASM Journals

ASM journals are the most prominent publications in the field, delivering up-to-date and authoritative coverage of both basic and clinical microbiology.

About ASM | Contact Us | Press Room

 

ASM is a member of

Scientific Society Publisher Alliance

Copyright © 2021 American Society for Microbiology | Privacy Policy | Website feedback

Online ISSN: 2379-5077