Avoiding Pandemic Fears in the Subway and Conquering the Platypus

Metagenomics is increasingly used not just to show patterns of microbial diversity but also as a culture-independent method to detect individual organisms of intense clinical, epidemiological, conservation, forensic, or regulatory interest. A widely reported metagenomic study of the New York subway suggested that the pathogens Yersinia pestis and Bacillus anthracis were part of the “normal subway microbiome.”

T he development and validation of novel methods that use next-generation DNA sequence data to detect pathogens from complex ecosystems represent important areas of research. In particular, these methods are important in studies of the built environment and of agricultural systems, where the correct detection of pathogens represents enormous public benefit and where incorrect detection creates fear. For example, in a recent study of the New York subway (1), due to incorrect taxonomic classifications, the authors reported observing Yersinia pestis (the causative agent of plague) and Bacillus anthracis (the causative agent of anthrax) as part of the "normal subway microbiome." These observations led to high-visibility news reports. But improved reanalysis of the same data by Hsu et al. (2) demonstrated that these results were illusory. Hsu et al. found that these pathogens were not part of the normal subway microbiome, either in New York or in an independent sample set from the Boston subway. They drew the more plausible conclusion that the surfaces were dominated by inputs of normal human skin bacteria, consistent with other studies, and found that the subway was not a reservoir of bacterially encoded toxins or antimicrobial resistance elements. That carefully conducted study added fundamentally to our knowledge of the transmission and expression of microbes in high-traffic built environments.
Another example of the importance of accurate pathogen identification from nextgeneration sequencing data is the ability to detect Salmonella from fresh produce. In a study by Ottesen et al. (3), the authors could not confirm the presence of Salmonella on the tomato crops through the use of 16S amplicon sequencing. However, an analysis of shotgun data from samples collected from the roots, leaves, and fruits of the tomato plants performed using the MG-RAST server reported hits corresponding to Salmonella. Furthermore, this analysis also showed the surprising presence of Gallus gallus (red jungle fowl), Mus musculus (house mouse), and even the elusive Ornithorhynchus anatinus (duck-billed platypus).

Detecting the presence of specific taxa from MG-RAST public datasets.
To exemplify the pervasiveness of false positives in MG-RAST, we downloaded all public samples (25,943 samples; accessed 22 April 2015), searched each report for Salmonella, Raphus (dodo bird), Thylacinus (Tasmanian tiger), and Ornithorhynchus (duck-billed platypus), and summarized the findings by the countries in which these organisms were observed on the basis of the latitude and longitude fields in the associated metadata (Table 1). A Jupyter (8) Notebook reproducing this report can be found in http://goo.gl/UIhBjf.

Conquering the platypus.
To demonstrate how the problem of confirming the presence of specific taxa in metagenomic samples can be addressed, we created Platypus Conquistador (https://github.com/biocore/Platypus-Conquistador), a BSDlicensed Python package based on BLAST (4) and SortMeRNA (5). Platypus Conquistador confirms the presence or absence of a taxon of interest within shotgun metagenomic datasets by relying on two reference sequence databases: an inclusion database, which includes the sequences of interest (e.g., Salmonella), and an exclusion database, which includes any known sequence background (e.g., platypus). The reference sequence databases are expected to be mutually exclusive. In general, these two databases can be created by partitioning an existing database, such as the gene data provided by the Integrated Microbial Genomes (IMG) (6) system. These partitions can be customized to include taxa of specific interest. This method has been used by Ottesen et al. (7) to describe the efficacy of enrichment steps in the effort to culture Salmonella from tomatoes. For that analysis, the authors ran Platypus Conquistador on shotgun metagenomic data using the IMG database split into a reference database, including only those sequences assigned to Salmonella, and an exclusion database containing all remaining sequences, demonstrating the absence of this pathogen.

Conclusions. Simple bioinformatics solutions exist to detect taxa of interest and to
resolve incorrect taxonomic classifications for shotgun sequencing data. Incorrect but pervasive taxonomic classifications can lead to conclusions that lack prima facie validity (for example, environments in which the platypus was reportedly found include environments from the built environment to the human gut). Worse, these incorrect assignments have great potential to spark unwarranted public concern, as was seen in the case of the NYC subway microbiome paper noted above.
These examples should also serve as a reminder that, although analytical software pipelines and computational methods can be thoroughly tested and validated, their results are based on user-specified parameters that change the results and, as a consequence, their validity. Researchers must always question the rationality of the parameters and meaning of the results to reduce the possibility of incorrect conclusions. Moving toward standardized and reproducible pipelines of analysis that can be scrutinized by our peers will greatly help avoid similar problems in the future. For  (4), China (2) pathogen detection, it is critical to additionally define taxon inclusion and exclusion criteria based on the studied environment in order to discard misleading results. This is especially important in cases of intense public interest, such as exposure in systems used by millions of people every day to apparent pathogens that are as illusory as the benthic Platypus.