Skip to main content
  • ASM Journals
    • Antimicrobial Agents and Chemotherapy
    • Applied and Environmental Microbiology
    • Clinical Microbiology Reviews
    • Clinical and Vaccine Immunology
    • EcoSal Plus
    • Infection and Immunity
    • Journal of Bacteriology
    • Journal of Clinical Microbiology
    • Journal of Microbiology & Biology Education
    • Journal of Virology
    • mBio
    • Microbiology and Molecular Biology Reviews
    • Microbiology Resource Announcements
    • Microbiology Spectrum
    • Molecular and Cellular Biology
    • mSphere
    • mSystems
  • Log in
  • My alerts
  • My Cart

Main menu

  • Home
  • Articles
    • Latest Articles
    • Special Issues
    • COVID-19 Special Collection
    • Editor's Picks
    • Special Series: Sponsored Minireviews and Video Abstracts
    • Archive
  • Topics
    • Applied and Environmental Science
    • Ecological and Evolutionary Science
    • Host-Microbe Biology
    • Molecular Biology and Physiology
    • Novel Systems Biology Techniques
    • Early-Career Systems Microbiology Perspectives
  • For Authors
    • Getting Started
    • Submit a Manuscript
    • Scope
    • Editorial Policy
    • Submission, Review, & Publication Processes
    • Organization and Format
    • Errata, Author Corrections, Retractions
    • Illustrations and Tables
    • Nomenclature
    • Abbreviations and Conventions
    • Publication Fees
    • Ethics
  • About the Journal
    • About mSystems
    • Editor in Chief
    • Board of Editors
    • For Reviewers
    • For the Media
    • For Librarians
    • For Advertisers
    • Alerts
    • RSS
    • FAQ
  • ASM Journals
    • Antimicrobial Agents and Chemotherapy
    • Applied and Environmental Microbiology
    • Clinical Microbiology Reviews
    • Clinical and Vaccine Immunology
    • EcoSal Plus
    • Infection and Immunity
    • Journal of Bacteriology
    • Journal of Clinical Microbiology
    • Journal of Microbiology & Biology Education
    • Journal of Virology
    • mBio
    • Microbiology and Molecular Biology Reviews
    • Microbiology Resource Announcements
    • Microbiology Spectrum
    • Molecular and Cellular Biology
    • mSphere
    • mSystems

User menu

  • Log in
  • My alerts
  • My Cart

Search

  • Advanced search
mSystems
publisher-logosite-logo

Advanced Search

  • Home
  • Articles
    • Latest Articles
    • Special Issues
    • COVID-19 Special Collection
    • Editor's Picks
    • Special Series: Sponsored Minireviews and Video Abstracts
    • Archive
  • Topics
    • Applied and Environmental Science
    • Ecological and Evolutionary Science
    • Host-Microbe Biology
    • Molecular Biology and Physiology
    • Novel Systems Biology Techniques
    • Early-Career Systems Microbiology Perspectives
  • For Authors
    • Getting Started
    • Submit a Manuscript
    • Scope
    • Editorial Policy
    • Submission, Review, & Publication Processes
    • Organization and Format
    • Errata, Author Corrections, Retractions
    • Illustrations and Tables
    • Nomenclature
    • Abbreviations and Conventions
    • Publication Fees
    • Ethics
  • About the Journal
    • About mSystems
    • Editor in Chief
    • Board of Editors
    • For Reviewers
    • For the Media
    • For Librarians
    • For Advertisers
    • Alerts
    • RSS
    • FAQ
Special Issue Perspective | Clinical Science and Epidemiology

Data and Statistical Methods To Analyze the Human Microbiome

Levi Waldron
Levi Waldron
aGraduate School of Public Health and Health Policy, City University of New York, New York, New York, USA
bInstitute for Implementation Science in Population Health, City University of New York, New York, New York USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Levi Waldron
DOI: 10.1128/mSystems.00194-17
  • Article
  • Info & Metrics
  • PDF
Loading

ABSTRACT

The Waldron lab for computational biostatistics bridges the areas of cancer genomics and microbiome studies for public health, developing methods to exploit publicly available data resources and to integrate -omics studies.

mSystems® vol. 3, no. 2, is a special issue sponsored by Janssen Human Microbiome Institute (JHMI).

PERSPECTIVE

The rapidly developing field of human microbiome studies will benefit from adapting the statistical and computational methods of more mature areas of high-dimensional data analysis and from ongoing use of the growing catalog of publicly available microbiome data. This perspective discusses methods and resources for robust identification of differentially abundant microbes and predictive models of microbiome-linked health outcomes. I summarize lessons from high-dimensional data analysis for cancer genomics and efforts by my lab to leverage and adapt the Bioconductor project for analysis and comprehension of high-throughput genomic data (1) to bring value-added published data, meta-analysis, and methods for multiomic data analysis to the microbiome community.

COMPARATIVE ANALYSIS AND META-ANALYSIS FOR DIFFERENTIAL ABUNDANCE

Differential abundance analysis is probably the most common objective of microbiome profiling studies and genomics studies in general. The objective is to identify microbial taxa, anywhere on the tree of life, that are over- or underabundant in some condition relative to a reference condition. These conditions can be observed or experimentally determined. The most commonly used methods for differential abundance analysis are LEfSe (2) and a variety of tools based on log linear regression models with negative binomial (3) or zero-inflated Gaussian error models (4). Regression approaches involve a false-discovery rate estimation to correct for multiple-hypothesis testing. Log linear modeling approaches build on a large body of statistical and computational work and provide several practical advantages. First, regression approaches eliminate the need for rarefaction, a process that has been described as “inadmissable” for the identification of differentially abundant taxa (5) because it throws away potentially useful data, the extra reads from samples with greater sequencing depth. Second, they adapt empirical Bayesian methods developed to reduce false-positive results in microarray differential expression analysis by “borrowing” information across taxa on how taxa are distributed across samples. Finally, they accommodate multivariate models that can be used for causal inference, such as to control for confounding effects or to test hypotheses of the microbiome as a mediator between environmental exposure and health outcomes. Regression modeling, now the almost exclusive choice for differential expression analysis of RNA sequencing data, is also well suited to metatranscriptomic differential abundance analysis.

These efforts can be enhanced by the standardization and reuse of published data for meta-analysis, comparative analysis, and method development. Thus, my lab developed the curatedMetagenomicData database (6) in collaboration with the laboratories of Nicola Segata (MetaPhlAn2 [7] and other methods for metagenomics), Curtis Huttenhower (developers of the bioBakery [8] and many methods therein), and Martin Morgan (head of the Bioconductor project [1]). This database provides more than 6,000 human-associated shotgun metagenomic profiles, uniformly processed from raw sequencing data to provide taxonomic abundance (7) and metabolic functional potential (9). Samples are primarily from stool specimens but include the Human Microbiome Project and other data sets sampling from other human body sites. We developed a fully automated, cloud-based pipeline to facilitate ongoing addition and updating of the database as new metagenomes and reference genomes become available and to encourage community contributions and even creation of alternative and competing databases.

MULTIOMIC INVESTIGATION OF THE MICROBIOME

Metagenomic studies, as in other areas of genomics, increasingly incorporate multiple assays in an experiment. My lab recently published MultiAssayExperiment (10), software for the integration of multiomics experiments in Bioconductor. MultiAssayExperiment has enabled coordinated representation and manipulation of multiple -omics data types for 11,000 patients and 33 cancers studied as part of the Cancer Genome Atlas. A more complete picture of host-microbiome relationships may also be developed by collecting multiple -omics data types, and I have been involved in studies including metatranscriptomics (11) and host gene expression (12) in addition to taxonomic and functional microbiome abundance data. To overcome the complexity of reproducible data analysis and interpretation of such experiments, I am working with other Bioconductor microbiome package developers to create a common standard for representing microbiome data. This standard will provide compatibility with MultiAssayExperiment and with recent advances based on HDF5 and Google BigTable for on-disk data and remote representation of very large data. This will, for example, allow curatedMetagenomicData (6) to represent taxonomic, gene family, and metabolic functional profiles for more than 6,000 samples as a single Bioconductor object that users can interact with in almost the same way as they currently do with microbiome (4, 13) or gene expression data from a single study, even on a standard laptop.

PREDICTIVE MODELING/MACHINE LEARNING

Prediction of health outcomes is a complementary objective to differential abundance analysis. Although similar models are sometimes used for these different objectives, the objective of making accurate predictions motivates different methods for model development and assessment. A mainstream approach to prediction modeling in high-dimensional data is to apply multivariate penalized regression, or machine learning methods such as Support Vector Machine, in conjunction with cross-validation to assess prediction accuracy. These approaches have been quickly adopted for prediction of health status from microbiome data. Colleagues and I have previously shown in meta-analyses of cancer transcriptomes that such approaches are prone to overoptimistic estimation of prediction accuracy (14). There are numerous possible reasons for such overoptimism. The data used to develop prediction models are by necessity retrospective, meaning they are predicting the past and not the future. “Information leakage” in data set through incorrect cross-validation, “reverse causality” effects of treatment on the microbiome, batch effects introduced by knowledge of outcomes, for example by sequencing cases together and then sequencing controls in another batch. Most studies do not collect statistically random samples, and therefore, the samples are not representative of the population.

Even with these challenges, it is sometimes still possible to develop accurate models of disease state and outcome from high-dimensional data. Colleagues and I showed that systematic leave-one-data set-in cross-study validation (15) of independent publicly available data sets provides a more realistic picture of generalizable prediction accuracy and that heterogeneous studies can be used to train robust prediction models through leave-one-data set-out cross-study validation (16). We have also shown the value of these approaches for metagenomic prediction problems (17). In systematic cross-study validation of gene expression-based models of cancer patient prognosis, we have shown even simple and suboptimal machine learning algorithms to be competitive with complex, theoretically optimal methods (18). Standardized databases like curatedMetagenomicData (6) and our in-development HMP16SData package (http://bioconductor.org/packages/HMP16SData/) will facilitate future work to find the limits of accuracy for disease prediction from all available microbiome profiles.

FUTURE OUTLOOK

Discoveries that are replicable across independent experiments are more likely to be valid and useful than those seen only in a single data set. My research aims to harness publicly available microbiome data through curation, integration and standardization, novel reanalysis, and methodological development. I aim to ensure that studies of the human microbiome benefit from concurrent methodological development in other areas of genomics and from the growing body of publicly available microbiome data. These benefits include more reliable identification of differentially abundant microbial species, strains, and community structure and the development of disease prediction models that hold up to independent validation across populations. I see the Bioconductor project as providing a unique opportunity for the microbiome community to leverage more than 15 years of development of statistical methods for -omics data and to integrate microbiome data with other types of high-throughput data. As such, I plan to continue developing the Bioconductor platform to the needs of the microbiome community, through the development of databases, promotion of standards for data representation, and development of needed methods for data manipulation and analysis.

ACKNOWLEDGMENTS

The work discussed in this perspective was funded by the National Cancer Institute (U24CA180996) and by the National Institute of Allergy and Infectious Diseases (1R21AI121784-01) of the National Institutes of Health.

FOOTNOTES

    • Received November 18, 2017.
    • Accepted December 7, 2017.
  • Copyright © 2018 Waldron.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.

REFERENCES

  1. 1.↵
    1. Huber W,
    2. Carey VJ,
    3. Gentleman R,
    4. Anders S,
    5. Carlson M,
    6. Carvalho BS,
    7. Bravo HC,
    8. Davis S,
    9. Gatto L,
    10. Girke T,
    11. Gottardo R,
    12. Hahne F,
    13. Hansen KD,
    14. Irizarry RA,
    15. Lawrence M,
    16. Love MI,
    17. MacDonald J,
    18. Obenchain V,
    19. Oleś AK,
    20. Pagès H,
    21. Reyes A,
    22. Shannon P,
    23. Smyth GK,
    24. Tenenbaum D,
    25. Waldron L,
    26. Morgan M
    . 2015. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12:115–121. doi:10.1038/nmeth.3252.
    OpenUrlCrossRefPubMed
  2. 2.↵
    1. Segata N,
    2. Izard J,
    3. Waldron L,
    4. Gevers D,
    5. Miropolsky L,
    6. Garrett WS,
    7. Huttenhower C
    . 2011. Metagenomic biomarker discovery and explanation. Genome Biol 12:R60. doi:10.1186/gb-2011-12-6-r60.
    OpenUrlCrossRefPubMed
  3. 3.↵
    1. Love MI,
    2. Huber W,
    3. Anders S
    . 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550. doi:10.1186/s13059-014-0550-8.
    OpenUrlCrossRefPubMed
  4. 4.↵
    1. Paulson JN,
    2. Stine OC,
    3. Bravo HC,
    4. Pop M
    . 2013. Differential abundance analysis for microbial marker-gene surveys. Nat Methods 10:1200–1202. doi:10.1038/nmeth.2658.
    OpenUrlCrossRefPubMedWeb of Science
  5. 5.↵
    1. McMurdie PJ,
    2. Holmes S
    . 2014. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 10:e1003531. doi:10.1371/journal.pcbi.1003531.
    OpenUrlCrossRefPubMed
  6. 6.↵
    1. Pasolli E,
    2. Schiffer L,
    3. Manghi P,
    4. Renson A,
    5. Obenchain V,
    6. Truong DT,
    7. Beghini F,
    8. Malik F,
    9. Ramos M,
    10. Dowd JB,
    11. Huttenhower C,
    12. Morgan M,
    13. Segata N,
    14. Waldron L
    . 2017. Accessible, curated metagenomic data through ExperimentHub. Nat Methods 14:1023–1024. doi:10.1038/nmeth.4468.
    OpenUrlCrossRef
  7. 7.↵
    1. Truong DT,
    2. Franzosa EA,
    3. Tickle TL,
    4. Scholz M,
    5. Weingart G,
    6. Pasolli E,
    7. Tett A,
    8. Huttenhower C,
    9. Segata N
    . 2015. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods 12:902–903. doi:10.1038/nmeth.3589.
    OpenUrlCrossRefPubMed
  8. 8.↵
    1. McIver LJ,
    2. Abu-Ali G,
    3. Franzosa EA,
    4. Schwager R,
    5. Morgan XC,
    6. Waldron L,
    7. Segata N,
    8. Huttenhower C
    . 29 November 2017. bioBakery: a meta’omic analysis environment. Bioinformatics doi:10.1093/bioinformatics/btx754.
    OpenUrlCrossRef
  9. 9.↵
    1. Abubucker S,
    2. Segata N,
    3. Goll J,
    4. Schubert AM,
    5. Izard J,
    6. Cantarel BL,
    7. Rodriguez-Mueller B,
    8. Zucker J,
    9. Thiagarajan M,
    10. Henrissat B,
    11. White O,
    12. Kelley ST,
    13. Methé B,
    14. Schloss PD,
    15. Gevers D,
    16. Mitreva M,
    17. Huttenhower C
    . 2012. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol 8:e1002358. doi:10.1371/journal.pcbi.1002358.
    OpenUrlCrossRefPubMed
  10. 10.↵
    1. Ramos M,
    2. Schiffer L,
    3. Re A,
    4. Azhar R,
    5. Basunia A,
    6. Rodriguez C,
    7. Chan T,
    8. Chapman P,
    9. Davis SR,
    10. Gomez-Cabrero D,
    11. Culhane AC,
    12. Haibe-Kains B,
    13. Hansen KD,
    14. Kodali H,
    15. Louis MS,
    16. Mer AS,
    17. Riester M,
    18. Morgan M,
    19. Carey V,
    20. Waldron L
    . 2017. Software for the integration of multiomics experiments in Bioconductor. Cancer Res 77:e39–e42. doi:10.1158/0008-5472.CAN-17-0344.
    OpenUrlAbstract/FREE Full Text
  11. 11.↵
    1. Franzosa EA,
    2. Morgan XC,
    3. Segata N,
    4. Waldron L,
    5. Reyes J,
    6. Earl AM,
    7. Giannoukos G,
    8. Boylan MR,
    9. Ciulla D,
    10. Gevers D,
    11. Izard J,
    12. Garrett WS,
    13. Chan AT,
    14. Huttenhower C
    . 2014. Relating the metatranscriptome and metagenome of the human gut. Proc Natl Acad Sci U S A 111:E2329–E2338. doi:10.1073/pnas.1319284111.
    OpenUrlCrossRef
  12. 12.↵
    1. Morgan XC,
    2. Kabakchiev B,
    3. Waldron L,
    4. Tyler AD,
    5. Tickle TL,
    6. Milgrom R,
    7. Stempak JM,
    8. Gevers D,
    9. Xavier RJ,
    10. Silverberg MS,
    11. Huttenhower C
    . 2015. Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease. Genome Biol 16:67. doi:10.1186/s13059-015-0637-x.
    OpenUrlCrossRefPubMed
  13. 13.↵
    1. McMurdie PJ,
    2. Holmes S
    . 2013. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One 8:e61217. doi:10.1371/journal.pone.0061217.
    OpenUrlCrossRefPubMed
  14. 14.↵
    1. Waldron L,
    2. Haibe-Kains B,
    3. Culhane AC,
    4. Riester M,
    5. Ding J,
    6. Wang XV,
    7. Ahmadifar M,
    8. Tyekucheva S,
    9. Bernau C,
    10. Risch T,
    11. Ganzfried BF,
    12. Huttenhower C,
    13. Birrer M,
    14. Parmigiani G
    . 2014. Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer. J Natl Cancer Inst 106:dju049. doi:10.1093/jnci/dju049.
    OpenUrlCrossRefPubMed
  15. 15.↵
    1. Bernau C,
    2. Riester M,
    3. Boulesteix A-L,
    4. Parmigiani G,
    5. Huttenhower C,
    6. Waldron L,
    7. Trippa L
    . 2014. Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30:i105–i112. doi:10.1093/bioinformatics/btu279.
    OpenUrlCrossRefPubMed
  16. 16.↵
    1. Riester M,
    2. Wei W,
    3. Waldron L,
    4. Culhane AC,
    5. Trippa L,
    6. Oliva E,
    7. Kim S-H,
    8. Michor F,
    9. Huttenhower C,
    10. Parmigiani G,
    11. Birrer MJ
    . 2014. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. J Natl Cancer Inst 106:dju048. doi:10.1093/jnci/dju048.
    OpenUrlCrossRefPubMed
  17. 17.↵
    1. Pasolli E,
    2. Truong DT,
    3. Malik F,
    4. Waldron L,
    5. Segata N
    . 2016. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput Biol 12:e1004977. doi:10.1371/journal.pcbi.1004977.
    OpenUrlCrossRef
  18. 18.↵
    1. Zhao SD,
    2. Parmigiani G,
    3. Huttenhower C,
    4. Waldron L
    . 2014. Más-o-menos: a simple sign averaging method for discrimination in genomic data analysis. Bioinformatics 30:3062–3069. doi:10.1093/bioinformatics/btu488.
    OpenUrlCrossRefPubMed
PreviousNext
Back to top
Download PDF
Citation Tools
Data and Statistical Methods To Analyze the Human Microbiome
Levi Waldron
mSystems Mar 2018, 3 (2) e00194-17; DOI: 10.1128/mSystems.00194-17

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Print
Alerts
Sign In to Email Alerts with your Email Address
Email

Thank you for sharing this mSystems article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Data and Statistical Methods To Analyze the Human Microbiome
(Your Name) has forwarded a page to you from mSystems
(Your Name) thought you would be interested in this article in mSystems.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Data and Statistical Methods To Analyze the Human Microbiome
Levi Waldron
mSystems Mar 2018, 3 (2) e00194-17; DOI: 10.1128/mSystems.00194-17
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Top
  • Article
    • ABSTRACT
    • PERSPECTIVE
    • COMPARATIVE ANALYSIS AND META-ANALYSIS FOR DIFFERENTIAL ABUNDANCE
    • MULTIOMIC INVESTIGATION OF THE MICROBIOME
    • PREDICTIVE MODELING/MACHINE LEARNING
    • FUTURE OUTLOOK
    • ACKNOWLEDGMENTS
    • FOOTNOTES
    • REFERENCES
  • Info & Metrics
  • PDF

KEYWORDS

machine learning
meta-analysis
metagenomics
statistical analysis

Related Articles

Cited By...

About

  • About mSystems
  • Author Videos
  • Board of Editors
  • Policies
  • Overleaf Pilot
  • For Reviewers
  • For the Media
  • For Librarians
  • For Advertisers
  • Alerts
  • RSS
  • FAQ
  • Permissions
  • Journal Announcements

Authors

  • ASM Author Center
  • Submit a Manuscript
  • Author Warranty
  • Types of Articles
  • Getting Started
  • Ethics
  • Contact Us

Follow #mSystemsJ

@ASMicrobiology

       

 

ASM Journals

ASM journals are the most prominent publications in the field, delivering up-to-date and authoritative coverage of both basic and clinical microbiology.

About ASM | Contact Us | Press Room

 

ASM is a member of

Scientific Society Publisher Alliance

Copyright © 2021 American Society for Microbiology | Privacy Policy | Website feedback

Online ISSN: 2379-5077