ABSTRACT
Since the turn of the century, technological advances have made it possible to obtain the molecular profile of any tissue in a cost-effective manner. Among these advances are sophisticated high-throughput assays that measure the relative abundances of microorganisms, RNA molecules, and metabolites. While these data are most often collected to gain new insights into biological systems, they can also be used as biomarkers to create clinically useful diagnostic classifiers. How best to classify high-dimensional -omics data remains an area of active research. However, few explicitly model the relative nature of these data and instead rely on cumbersome normalizations. This report (i) emphasizes the relative nature of health biomarkers, (ii) discusses the literature surrounding the classification of relative data, and (iii) benchmarks how different transformations perform for regularized logistic regression across multiple biomarker types. We show how an interpretable set of log contrasts, called balances, can prepare data for classification. We propose a simple procedure, called discriminative balance analysis, to select groups of 2 and 3 bacteria that can together discriminate between experimental conditions. Discriminative balance analysis is a fast, accurate, and interpretable alternative to data normalization.
IMPORTANCE High-throughput sequencing provides an easy and cost-effective way to measure the relative abundance of bacteria in any environmental or biological sample. When these samples come from humans, the microbiome signatures can act as biomarkers for disease prediction. However, because bacterial abundance is measured as a composition, the data have unique properties that make conventional analyses inappropriate. To overcome this, analysts often use cumbersome normalizations. This article proposes an alternative method that identifies pairs and trios of bacteria whose stoichiometric presence can differentiate between diseased and nondiseased samples. By using interpretable log contrasts called balances, we developed an entirely normalization-free classification procedure that reduces the feature space and improves the interpretability, without sacrificing classifier performance.
INTRODUCTION
Many of the newest assays used in molecular research produce data that are relative in nature. This includes high-throughput sequencing (HTS), as used to quantify the presence of bacterial or gene species from environmental and biological samples. This also includes hyphenated chromatographic assays like liquid chromatography-mass spectrometry (LC-MS), as used to quantify the presence of proteins, lipids, or metabolites. HTS and LC-MS both generate high-dimensional data that can be used as health biomarkers to predict and surveil disease (1). They also both measure abundance by sampling from the total population. Consequently, the total number of molecules recorded for each sample is arbitrary, making these data compositional (2–8). Others have already demonstrated that compositionality confounds the routine application of univariate (9), correlation (10), and distance (11) measures. Since machine learning pipelines often rely on these measures, compositionality may impact the accuracy of classifiers trained on these data (2, 12).
Compositional data analyses tend to have one of three flavors depending on the transformation used. Although these transformations have technical differences, the choice between them will often depend on the desired interpretation. First, the “simple” log ratio approach uses a single reference to recast the data. Most commonly, the reference is the per-sample geometric mean (centered log ratio [CLR] transformation) or a single component (additive log ratio [ALR] transformation), but the geometric mean of interquartile range components (13) and of nonzero components (14) have also been proposed. After transformation, the analysis then proceeds as if the data were absolute, but with a caveat: the interpretation of the results depends on the reference used. Second, the “pragmatic approach” analyzes pairwise log ratios directly; this type of analysis has been used to score important genes (15) and gene pairs (16, 17), and to reduce the dimensionality of the data (17). This approach makes sense when the ratios themselves have some importance to the analyst. However, it presents a clear problem for the classification of high-dimensional data: ratios “explode” feature space from D features to
Several studies have applied supervised statistical learning to compositional data. Aitchison trained linear discriminant analysis (LDA) models on ALR-transformed data (28), as have others (29) (though LDA is now usually applied to ILR-transformed data [29, 30]). Generalized linear models, including logistic regression (LR), have also been used to classify compositional data (30, 31). However, both LDA and LR require at least as many samples as features, making them inappropriate for high-dimensional health biomarker data (though this limitation is mitigated by regularization, as used previously [32, 33] to classify compositions). Partial least squares (PLS), also suitable for high-dimensional data, has been applied to CLR-transformed data to predict continuous outcomes (34), while PLS discriminant analysis (PLS-DA) has been used to classify both CLR-transformed (35) and ILR-transformed (36) data. In microbiome research, a stepwise algorithm, implemented as selbal, was proposed to identify a single balance that performs well in classification and regression tasks (37). The last work highlights an advantage of balances: although ALR, CLR, and ILR transformations can facilitate statistical learning, balances can engineer the feature space into interpretable biomarker scores via balance selection. These biomarker scores are not unlike the Firmicutes-to-Bacteroidetes ratio previously found to be associated with obesity (38). In fact, one could think of balance selection as a way of finding important bacteria ratios in a more rigorous and general manner.
How best to classify high-dimensional compositional data remains an open question. We are not aware of any work that benchmarks compositional data transformations as they pertain to the classification of high-dimensional compositional data. In this study, we employed a statistically robust cross-validation scheme to evaluate how well regularized LR classifies health-related binary outcomes on 13 compositional data sets. Specifically, we benchmarked performance using features obtained from raw proportions, CLR-transformed data, balances, and selected balances. We used LR instead of other classifiers because the model weights can be interpreted directly as a measure of feature importance and because regression is a routine part of statistical inference. Our results show that the centered log ratio transformation, and all four balance procedures, outperforms raw proportions for the classification of health biomarker data. We also propose a new balance selection procedure, called discriminatory balance analysis, that offers a computationally efficient way to select important 2- and 3-part balances. These discriminant balances reduce the feature space and improve the interpretability without sacrificing classifier performance. In doing so, they also outperform a recently published balance selection method, selbal, in terms of runtime and classification accuracy.
RESULTS AND DISCUSSION
Choice in log ratio transformation does not impact performance.Figure 1 shows the validation set areas under the receiver operating curves (AUCs) for binary classifiers trained on 13 data sets. In general, it can be seen that the centered log ratio transformation (CLR) and balance procedures (principal balance analysis [PBA], anti-principal balance analysis [ABA], random balance analysis [RBA], and discriminative balance analysis [DBA]) perform comparably. Although they all tend to outperform proportions (ACOMP), the proportions were more discriminative than the CLR for a few tests. This might occur when the closure bias itself confounds the predicted outcome.
The distribution of validation set AUCs (y axis) for classifiers trained on closed or transformed data (x axis). Each validation set AUC describes a unique random training and validation set split. All classifiers are regularized logistic regression models, with λ tuned by training set cross-validation. Abbreviations: ACOMP, closed proportions; CLR, centered log ratio-transformed data; PBA, principal balances; ABA, anti-principal balances; RBA, random balances; DBA, discriminative balances.
Table 1 shows the median of the difference between data transformations (as computed with pairwise Wilcoxon rank sum tests across all 13 tests). Here, it can be seen that every transformation performs better than proportions. Also, all balance procedures tend to perform equally well, though DBA balances perform marginally better. Although selbal posts an impressive accuracy for only using a single balance, it is less accurate than using a set of all balances.
Medians of the differences in performance between data transformation methodsa
DBA method selects predictive balances.An advantage of using regularized logistic regression is that the model weights can be interpreted as a measure of feature importance. Even though the CLR and balances perform equally well, they imply different interpretations. Although the CLR data have one feature per component, the regularized weights do not describe the importance of that component. Rather, the CLR-based model weights describe the importance of that component relative to the sample mean. On the other hand, balances measure the log contrast between sets of components. Thus, the balance-based model weights describe the importance of those components directly.
For high-dimensional data, it can be challenging to interpret large balances. For example, the base of an SBP always contains one balance that comprises all variables. It may not be helpful in understanding the outcome to know that a log contrast involving all components is discriminative. On the other hand, smaller balances (i.e., those involving fewer components) might have a clearer meaning to the analyst. Here, we propose a new procedure, called discriminative balance analysis, to generate an SBP that makes the smallest balances most discriminative. This procedure can be used to engineer and select important balances prior to model building. Since the selected balances contain few parts, they are more easily interpreted.
Conceptualizing the SBP as a tree, the largest balances are the “trunk” and the smallest balances are the “leaves” (Fig. 2). Since the SBP corresponds to an underlying orthonormal basis, we can treat each segment of the tree as its own variable. Figure 3 shows classification AUC using only the “distal leaf” balances (i.e., those with 2 or 3 parts). In principal balance analysis, the trunk contains the most variance, and the leaves contain the least. As expected, the distal PBA balances perform poorly. In anti-principal balance analysis, the trunk contains the least variance, and the leaves contain the most. As expected, the distal ABA balances outperform the distal PBA balances. In random balance analysis, balances are random, so the leaves might be discriminative by chance. As expected, the distal RBA balances have an average performance. In discriminative balance analysis, the trunk is least discriminative, and the leaves are the most. As expected, the distal DBA balances outperform both the PBA and ABA balances. Indeed, since DBA places the most discriminative balances distally, the distal DBA balances perform as well as all DBA balances (see Table 2 for 95% confidence interval).
How a balance dendrogram relates to a serial binary partition (SBP) matrix. The left portion shows a dendrogram clustering the similarity between 6 components, where the first branch in the dendrogram refers to the first balance (i.e., a and e versus c, b, d, and f). The middle portion shows the corresponding SBP with 5 balances (columns) and the components involved in each log contrast (rows). The right portion shows the distal 2- and 3-part balances.
The distribution of validation set AUCs (y axis) for classifiers trained on selected balances (x axis). Each validation set AUC describes a unique random training and validation set split. All classifiers are regularized logistic regression models, with λ tuned by training set cross-validation. The appendix “-distal” indicates that only the 2-part and 3-part balances were used as features.
Medians of the differences in performance between balance selection methodsa
The DBA balances can be interpreted (and visualized) in an intuitive way. The 2-part balances can be visualized as a log ratio, while the 3-part balances can be visualized with a ternary diagram or as a log contrast. In Fig. 4, we compare the most important distal DBA balances (left) with the single discriminative balance found by selbal (right). It can be seen that many of the same variables are represented in both sets. However, DBA expresses the important variables via 2- and 3-part subsets that are, by definition of the SBP, grouped to be maximally discriminative. On the left side, it can be seen that balances with large regularized weights (top left) have log contrast scores that differentiate the groups (bottom left). Though selbal performs remarkably well in its ability to select a single discriminative balance, our results suggest that the distal DBA method outperforms selbal by ∼1 to 4% AUC (Table 2). Moreover, the distal DBA method is an order of magnitude faster than selbal, the latter of which must try multiple component combinations before finding the best log contrast (25 min versus 15 s for 1,000 features).
The most important distal DBA balances (left) compared with the results from selbal (right). In the top left portion are the regularized weights for each distal balance. In the bottom left portion is the distribution of samples for each balance irrespective of weight. The distal DBA classifier uses the weighted sum of these balances to make its prediction. In the right portion is the distribution of a single balance as selected by selbal. Many of the same variables are represented in both sets. DBA selects multiple simple balances instead of one complex balance. All panels generated using the 2a data set, comparing inflammatory bowel disease (in red) with healthy controls (in blue).
We cannot guarantee that these performance trends will hold for nonlinear classifiers like random forests or neural networks. However, a primary advantage of balances is that they allow for a clear interpretation of feature importance that is fully coherent for compositional data. If we do not first log ratio transform these relative data, then the predictive potential of any one feature will depend on all other features. This is because the relative abundances themselves all depend on each other. For example, given the composition [a, b, c], an increase in c will decrease both a and b, but the balance between a and b will not change. The use of nonlinear classifiers alone does not address this fundamental issue.
DBA as a discriminant ordination.By using an orthonormal basis, balances represent the total variance in terms of new variables that allow us to quantify the variance contained in each discriminative balance. We can also break down the contained variance into its between-group and within-group fractions (as done by an analysis of variance [ANOVA]). The left side of Fig. 5 shows that a large fraction of the (log ratio) variance contained in the distal DBA balances is between-group variance. This is because clustering components by
The amount of variance (as a percentage of the total) contained in each distal DBA balance (left), placed alongside a projection of the data across the top 3 most variable distal DBA balances (right). The sum of the between-group variance and the within-group variance equals the total variance. Good class separation is achieved using only 3 balances (each of which is proportional to a simple log ratio). Together, these 3 ratios contain 4.3% of the total variance and 13.8% of the total between-group variance. Both diagrams were generated using the 2a data set, comparing inflammatory bowel disease with healthy controls.
The right side of Fig. 5 shows good class separation using only 3 balances (each of which is actually a simple log ratio). From the left side, we know that these 3 axes contain 4.3% of the total variance and could likewise calculate that they contain 13.8% of the total between-group variance. Meanwhile, all distal DBA balances together account for 90.4% of the total between-group variance. Yet each one of these discriminant axes is fully interpretable, having no more than 3 parts. On the other hand, if the analyst cared less about interpretation and more about maximizing contained between-group variance, they could do a clustering of
A word of clarification about balances is in order. The term balances can be understood more strictly as the coordinates of an orthonormal basis of the sample space. Note that although this basis of the sample space is orthonormal, the balances themselves, when considered as vectors across samples, are not. Thus, discriminant balance variables will usually be correlated with each other.
Summary.This work benchmarks the performance of regularized logistic regression classifiers across 13 high-dimensional health biomarker data sets. Our results show that, on average, the centered log ratio and balances both outperform raw proportions in classification tasks. We also found that the serial binary partition (SBP) matrix used to generate the balances does not impact performance. However, the choice in SBP changes which balances are important for classification. In this report, we introduce a new SBP procedure that makes the most discriminative balances the smallest. This procedure, called discriminative balance analysis, offers a computationally efficient way to select important 2- and 3-part balances. These discriminant balances reduce the feature space and improve the interpretability without sacrificing classifier performance. In doing so, they also outperform a recently published balance selection method, selbal, in terms of runtime and classification accuracy. By using the distal DBA procedure, an analyst can quickly identify a set of highly interpretable bacteria ratios that best summarize the difference between their experimental conditions.
MATERIALS AND METHODS
Data acquisition.We acquired data from 4 principal sources. Two gut microbiome data sets (originally published in references 39 and 40) were acquired from the selbal package (37). Two additional gut microbiome data sets (originally published in references 41 and 42) were acquired from the supplement to the work of Duvallet et al. (MicrobiomeHD database) (43). A fifth gut microbiome data set was acquired from the supplement to the work of Franzosa et al. (44).
The data of Schubert et al. (42) contained 3 classes comparingth hospital-acquired diarrhea (HAD) with community-acquired diarrhea (CAD) and healthy controls (HC). This data set was used in two tests: HAD versus CAD and HAD versus HC. The data of Baxter et al. (41) contained 3 classes comparing colorectal cancer (CRC) with adenoma (AC) and HC. This data set was also used in two tests: CRC versus AC and CRC versus HC. The data of Franzosa et al. (44) contained 3 classes comparing Crohn’s disease (CD) and ulcerative colitis (UC) with HC. This data set was also used in two tests: CD and UC versus HC and CD versus UC. Franzosa et al. also published gut metabolomic data for the same samples. These data were used for an additional two tests that paralleled the gut microbiome tests.
A sixth data set was acquired from The Cancer Genome Atlas (TCGA) (45) and contained microRNA expression for primary breast cancer (BRCA) samples and healthy controls (HC). We further labeled the BRCA samples using PAM50 subtypes retrieved from the supplement to reference 46. PAM50 uses a gene expression signature to assign an intrinsic subtype to the primary breast cancer sample: subtypes include luminal A (LumA), luminal B, HER2-enriched, Basal, and Normal-like. These data were used in three tests: any BRCA versus HC, HER2+ versus all other BRCA, and LumA-BRCA versus LumB-BRCA.
We selected these data because they are all publicly available and because they represent a range of difficult-to-classify data types (16S, metagenomic, metabolomic, and microRNA). All data are available for immediate use in subsequent benchmarks from https://doi.org/10.5281/zenodo.3378099.
Feature extraction and zero handling.Before training any models, features with too few counts were removed from the data. For the metabolomic and microRNA data sets, only features within the top decile of total abundance were included (this was done to reduce the feature space so that selbal became computationally tractable). For all data sets, features that contained zeros in more than 90% of samples were excluded (this was done to remove biomarkers that are not reliably present in the data). Finally, zeros were replaced using a simple multiplicative replacement strategy via the zCompositions package (47) (this was done because the Bayesian replacement strategy fails for heavily zero-laden data). Table 3 summarizes the tests used in this study.
Data used to benchmark data transformation and balance selection methodsa
Data transformation.Let us consider a data matrix with entries xij which describe the relative abundance of
We also use the isometric log ratio (ILR) transformation to construct balances. Roughly speaking, balances are a way of combining the original features into new ones that better respect the geometry of the sample space. The most general way of doing so is in the form of a log-linear combination called a log contrast. A log contrast of a D-part composition
Balances are a way of constructing simple log-contrasts that are relatively easy to interpret (18). This is done using a serial binary partition (SBP) matrix. The SBP matrix describes D – 1 log contrasts between the D parts. These log contrasts are special in that they have
The serial binary partition matrix.We benchmark four procedures for generating an SBP. In PBA, we approximate a set of principal balances by hierarchically clustering the log ratio variance matrix, T, describing the relationship between any two variables j and j* (see reference 24):
Principal balances are analogous to principal components in that the first balance contains the most variance, the second balance the second most variance, and so on. Note that PBA only approximates the principal balances.
In ABA, we hierarchically cluster a new dissimilarity measure defined as the difference of the log ratio variance matrix from the maximum log ratio variance score:
Note that the SBP is always constructed using the training set only. The balance “rule” is then applied to the validation set prior to model deployment. All SBP procedures are implemented in the balance package with the functions sbp.fromPBA, sbp.fromABA, sbp.fromRandom, and sbp.fromPropd (49). Differential proportionality analysis is implemented in the propr package (50) with the function propd. The code snippet below provides a minimally reproducible example for computing distal discriminant balances.
# how to get distal discriminant balances
install.packages(“balance”)
library(balance)
data(iris)
x <- iris[,1:4]
y <- iris[5,]
sbp <- sbp.fromADBA(x, y) # get discriminant balances
sbp <- sbp.subset(sbp) # get distal balances only
z <- balance.fromSBP(
x = x, # the data to recast
y = sbp # the SBP to use
)
Classification pipeline.In order to get a robust measure of performance, we repeat model training on 50 training sets randomly sampled from the data (with 33% set aside as a validation set). For each training set, we (i) transform features as described above, (ii) train a model on the transformed features, (iii) deploy the model on the withheld validation set, and (iv) calculate the area under the receiver operating curve (AUC). AUC is used because it is commonly reported in biological studies. Model splitting, transformation, training, and prediction are all handled by the high-throughput classification software exprso (51). By repeating this procedure 50 times, we can calculate the median performance and its range.
When using selbal, a generalized linear model is trained on a single balance (as described in reference 37). For all other transformations, a least absolute shrinkage and selection operator (LASSO) model is used to select features and fit the data simultaneously (via the glmnet package [52]). When using LASSO, λ is chosen procedurally by measuring 5-fold training set cross-validation accuracy over the series exp(seq(log(0.001), log(5), length.out = 100)) (i.e., from 0.001 to 5 in 100 exponential steps), with the best λ selected automatically by cv.glmnet.
We use regularized logistic regression because it is highly interpretable: the model weights can be interpreted directly as a kind of importance score.
Availability of data and material.All methods are available through open-source software maintained by us.
ACKNOWLEDGMENTS
T.P.Q. thanks the authors of selbal for inspiring this work. T.P.Q. thanks Samuel C. Lee for his help with retrieving the TCGA data and the PAM50 labels. I.E. thanks Cedric Notredame for support. We both thank Michael Greenacre for clarifications regarding the notion of orthogonality in the context of balances.
We have no competing interests.
T.P.Q. implemented the procedures, performed the analyses, and drafted the manuscript. I.E. derived the differential proportionality metric, contributed code, and expanded the manuscript. Both authors conceptualized the thesis and approved the final manuscript.
FOOTNOTES
- Received April 10, 2019.
- Accepted March 5, 2020.
- Copyright © 2020 Quinn and Erb.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.