## ABSTRACT

Exploratory analysis of human microbiome data is often based on dimension-reduced graphical displays derived from similarities based on non-Euclidean distances, such as UniFrac or Bray-Curtis. However, a display of this type, often referred to as the principal-coordinate analysis (PCoA) plot, does not reveal which taxa are related to the observed clustering because the configuration of samples is not based on a coordinate system in which both the samples and variables can be represented. The reason is that the PCoA plot is based on the eigen-decomposition of a similarity matrix and not the singular value decomposition (SVD) of the sample-by-abundance matrix. We propose a novel biplot that is based on an extension of the SVD, called the generalized matrix decomposition biplot (GMD-biplot), which involves an arbitrary matrix of similarities and the original matrix of variable measures, such as taxon abundances. As in a traditional biplot, points represent the samples, and arrows represent the variables. The proposed GMD-biplot is illustrated by analyzing multiple real and simulated data sets which demonstrate that the GMD-biplot provides improved clustering capability and a more meaningful relationship between the arrows and points.

**IMPORTANCE** Biplots that simultaneously display the sample clustering and the important taxa have gained popularity in the exploratory analysis of human microbiome data. Traditional biplots, assuming Euclidean distances between samples, are not appropriate for microbiome data, when non-Euclidean distances are used to characterize dissimilarities among microbial communities. Thus, incorporating information from non-Euclidean distances into a biplot becomes useful for graphical displays of microbiome data. The proposed GMD-biplot accounts for any arbitrary non-Euclidean distances and provides a robust and computationally efficient approach for graphical visualization of microbiome data. In addition, the proposed GMD-biplot displays both the samples and taxa with respect to the same coordinate system, which further allows the configuration of future samples.

## INTRODUCTION

A biplot simultaneously displays, in two dimensions, rows (samples) of a data matrix as points and columns (variables) as arrows. Based on a matrix decomposition of the data matrix, the biplot is a useful graphical tool for visualizing the structure of large data matrices. It displays a dimension-reduced configuration of samples, as in a principal-coordinate analysis plot, and the variables with respect to the same set of coordinates. If meaningful sample groupings are observed, this allows visualizing which variables contribute most to the separation. The traditional biplot, as first introduced in reference 1, displays the first two left and right singular vectors of the singular value decomposition (SVD) of the data matrix as points and arrows, respectively. This biplot, which we hereafter refer to as the SVD-biplot, uses the SVD to find the optimal least-square representation of the data matrix in a low-dimensional space. The SVD-biplot can show Euclidean distances between samples and also display approximate variances and correlations of the variables. It also has the appealing property that the singular values obtained from the SVD are nonincreasing, indicating that the decomposition of the total variance of the data matrix into each dimension is nonincreasing.

In many scenarios, the Euclidean distance may not be optimal for characterizing dissimilarities between samples. An important example arises in the analysis of microbiome data, in which marker gene sequences (e.g., 16S rRNA) are often grouped into taxonomic categories using bioinformatic pipelines such as QIIME (2) or mothur (3). These taxon counts can be summarized into a data matrix with rows and columns representing samples and taxon abundances, respectively. A variety of non-Euclidean distance measures, including nonlinear measures, are then used to quantify the similarity between samples. One common measure of dissimilarity is the UniFrac distance (weighted or unweighted), which is a function of the phylogenetic dissimilarity of a pair of samples (4, 5). Other nonphylogenetic, non-Euclidean dissimilarities include Jaccard or Bray-Curtis distances (see, e.g., reference 6 and the references therein). Plotting the samples in the space of the first few principal components (PCs) of the similarity matrix obtained from such non-Euclidean distance matrices—often referred to as principal-coordinate analysis (PCoA)—may reveal an informative separation between samples. However, the configuration of samples yielded by PCoA keeps only pairwise distances between samples and lacks a coordinate system that relates to the taxa that constitute each sample. Hence, it does not shed any light on which taxa may play a role in this separation. One approach for addressing this problem is to simply plot an arrow for each taxon based on its correlation with the first two PCs of the non-Euclidean similarity matrix (7). However, in such a “joint plot” (8), the direction and length of an arrow does not represent the taxon’s true contribution to the dissimilarity between samples. In addition, due to the lack of a coordinate system, one cannot add sample points for future observations into this “joint plot.”

Three main approaches have been recently proposed to extend the SVD-biplot to more general distances defined on the samples. The R package “ade4” (9) provides a biplot that can handle weighted Euclidean distances but it cannot handle non-Euclidean distances. The second approach, proposed by Greenacre (10), aims to approximate the non-Euclidean distance by a weighted Euclidean distance. Weights are estimated for variables, and the biplot can subsequently be constructed using weighted least-square approximation of the matrix. This approach has a straightforward interpretation. However, the estimated weighted Euclidean distance may not capture all the information from the original non-Euclidean distance. A recent proposal in reference 11 appears to be the first to address the lack of mathematical duality between the samples’ locations (points) and the variables’ contribution (arrows) to those locations. This approach seeks an approximate SVD-like decomposition of the data matrix, which directly takes the non-Euclidean distance into consideration. This SVD-like decomposition has the following two advantages. First, the left singular vectors are the eigenvectors of the similarity measure derived from the non-Euclidean distance, which preserve the role of the non-Euclidean distance in classifying the samples. Second, an approximate matrix duality (AMD) between the left and right singular vectors is restored, which simply means that each set of vectors can be immediately obtained from the other. To emphasize this connection, we hereafter refer to this decomposition as the AMD. Unfortunately, the AMD also suffers from two drawbacks. First, the AMD is only an approximate decomposition of the data matrix, and hence may not capture all the variation of the original data. In particular, the configuration of samples displayed in an AMD-biplot is independent of the data matrix, since the left singular vectors of the AMD depend only on the non-Euclidean distances. Ignoring the data matrix for classifying samples seems nonintuitive since the data matrix is typically assumed to contain some information on the sample similarities. Second, the AMD may result in nondecreasing “singular values.” While these seem like minor technical issues, the second drawback can have important practical implications: which of the left and right singular vectors should be displayed in the resulting biplot? The authors of reference 11 suggest constructing the AMD-biplot based on the two left and right singular vectors that correspond to the two largest singular values. This AMD-biplot assures that the arrows for variables are as meaningful as possible, but they may fail to reveal meaningful sample clusters if the information of sample clusters is associated only with the first several left singular vectors. An alternative approach may be to simply display the first and second left and right singular vectors of the AMD (as done for the SVD). Unfortunately, this strategy does not solve the problem either: although we may observe meaningful sample clusters, the arrows may not be meaningful due to the small singular values. There is thus a lack of clarity regarding which singular vectors should be used to construct the AMD-biplot.

The drawbacks of the AMD-biplot motivate our proposal which is based on the generalized matrix decomposition biplot (GMD-biplot) (12). The GMD-biplot is a direct generalization of the SVD-biplot that accounts for structural dependencies among the samples and/or variables. This approach has several advantages. First, as with the AMD, it directly handles any non-Euclidean distance matrix. Specifically, the full information from that distance matrix is used. Second, unlike the AMD, which provides an approximate decomposition of the data matrix, the GMD provides an exact decomposition of the original data matrix without losing any information. Third, the GMD restores the matrix duality in a mathematically rigorous manner, unlike the approximate matrix duality obtained with the AMD; it naturally extends the duality inherent in the SVD and allows one to plot both the configuration of samples and the contribution of individual variables with respect to a new coordinate system. Fourth, the GMD gives nonincreasing GMD values, so the resulting GMD-biplot can be directly constructed based on the first two left and right GMD vectors. Last, unlike the AMD-biplot whose sample clusters depend only on distance, the GMD-biplot uses both the non-Euclidean distance and the data matrix for classifying samples, which more directly connects the contribution of the individual variables to the configuration of samples. Additionally, besides accounting for the non-Euclidean distances between samples, the GMD can also account for auxiliary information on (dis)similarities between the variables.

The remainder of this paper is organized as follows. We first illustrate the GMD-, AMD-, and SVD-biplots in three numerical studies. We then discuss advantages of the proposed GMD-biplot and further extensions. In Materials and Methods, we present detailed description of the GMD-biplot framework.

## RESULTS

In the results below, we compare the GMD-, AMD-, and SVD-biplots on three data sets in the manner that each has been proposed recently for microbiome data. In particular, in reference 11, the AMD-biplot is advocated specifically for relative abundance data, while in reference 13, the SVD-biplot is advocated for data that have been scaled by the centered log ratio (CLR) transformation. The GMD-biplot is constructed using the CLR-transformed data. We first examine the performance of all biplots using the smokeless tobacco data set explored in reference 11. In the second study, we compare their performances using the human gut microbiome data from reference 14. In the third analysis, we simulate a data set based on the smokeless tobacco data to illustrate a dilemma that the AMD-biplot may face.

Analysis of the smokeless tobacco data.This data set includes 15 smokeless tobacco products: 6 dry snuff samples, 7 moist snuff samples, and 2 toombak samples from Sudan. Three separate (replicate) observations (starting with sample preparation) were made of each product, so that a total of 45 observations are available. Each observation has a 271 × 1 vector of taxon counts, and thus, the data set can be formed into a 45 × 271 matrix, denoted by * _{n}* is an

*n*× 1 vector. Since

For the GMD-biplot, we consider the CLR transformation of

We constructed the GMD-biplot and the AMD-biplot based on *Alloiococcus* and *Halophilus*, while *Aerococcaceae* appears elevated in toomback samples.

Figure 1e, which is the same as the right bottom panel of Fig. 1 in reference 11, shows that the AMD singular values are not necessarily decreasing. It should be noted that Fig. 1b is slightly different from Fig. 3 in reference 11; this difference may be due to the use of

Additionally, we included the SVD-biplot and its corresponding scree plot in Fig. 1c and f, respectively. As the SVD-biplot assumes Euclidean distances between samples, it is more appropriate to construct the SVD-biplot using the CLR-transformed data

It is worth noting that the three biplots identify different top taxa, i.e., the taxa with the longest arrows. Although a biplot is not a rigorous statistical method to detect important taxa, it may shed light on which taxa are important to the observed sample clustering. To see this, we performed a univariate linear regression of each taxon (each column of *P* values representing the strength of association between each taxon and the tobacco groups. We then sorted these *P* values in a nondecreasing order and obtained the rank of each taxon based on the sorted *P* values. Hence, it is desirable that the taxa with the lowest ranks can be identified by the biplots. Table S1 in the supplemental material summarizes the ranks of the top 10 taxa identified by each biplot. It can be seen that the top 10 taxa identified by the GMD-biplot have lower ranks on average than those identified by the AMD and SVD biplots, indicating that the GMD-biplot may identify more meaningful taxa with respect to the separation of the samples than the AMD and SVD biplots.

### TABLE S1

Copyright © 2019 Wang et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

Analysis of human gut microbiome data.We consider the human gut microbiome data collected in a study of healthy children and adults from the Amazonas of Venezuela, rural Malawi, and U.S. metropolitan areas (14). The original data set

As concluded in reference 14, shared features of the functional maturation of the gut microbiome are identified during the first 3 years of life. We thus define a binary outcome *h _{i}* based on the age of the individual (in years) when each sample was taken as:

In all biplots, the *i*th sample is colored by age_{i} and symbolized by *h _{i}*. Figure 2d indicates that more than 80% of the total variance is explained by the GMD-biplot in Fig. 2a, which provides a good visualization of sample clusters across age. By examining the relationship between the arrows and the color of the sample points in Fig. 2a, we see that

*Prevotella*may be elevated in adults, while

*Parabacteroides*appears to be elevated in infants. In contrast, Fig. 2e shows that less than 15% of the total variance is explained by the AMD-biplot in Fig. 2b and the AMD values are not decreasing. As shown in Fig. 2b, the AMD-biplot also displays potential clusters across age, but the sample points are not as tightly clustered as those in Fig. 2a.

*Odoribacter*appears to be elevated in adults in Fig. 2b, while

*Lactobacillus*appears associated with infants. As a reference, Fig. 2c shows the SVD-biplot of

To further quantify the classification accuracy, for each biplot, we predicted the probability that each sample belongs to group 1 based on leave-one-out cross validation using the binary logistic regression of the group index *h _{i}* on the two selected components. We then plotted a receiver operating characteristic (ROC) curve for each biplot based on the predicted probabilities (see Fig. S1 in the supplemental material) and calculated the area under the ROC curve (AUC): the GMD-, AMD-, and SVD-biplots yield an AUC of 0.989, 0.976, and 0.990, respectively. The AUC results indicate that the GMD-biplot provides a better separation of age groups than the AMD-biplot, but there is not a clear difference between the GMD-biplot and the SVD-biplot. This may be because, for the CLR-transformed data

### FIG S1

Copyright © 2019 Wang et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.

We emphasize that both the GMD-biplot and the SVD-biplot identify *Prevotella* and *Parabacteroides* as top taxa, while the AMD-biplot identifies completely different ones. As reference 14 confirms that the trade-off between *Prevotella* and *Bacteroides* (including *Parabacteroides*) considerably drives the variation of microbiome abundance in adults and children between 0.6 and 1 year of age in all studied populations, the GMD- and SVD-biplots may thus identify more biologically meaningful taxa than the AMD-biplot. It should, however, be noted that these bacterial are “identified” based on circumstantial, not statistical, evidence, and more work is needed to examine statistical associations in this context.

Incorporating a kernel for variables into the GMD-biplot.The GMD problem (see equation 3 in Materials and Methods) allows not only the similarity kernel for samples but also a kernel for the variables. Including both kernels may further improve the accuracy of sample classification as well as the identification of important variables. We illustrate this advantage by incorporating a kernel for variables in the analysis of the human gut microbiome data. More specifically, we first calculate a matrix *Prevotella* but not *Parabacteroides*, which may explain the lower clustering accuracy.

Including an additional kernel for variables in the GMD-biplot is related to the method of double-principal-coordinate analysis (DPCoA) (16). DPCoA, as shown in reference 17, is equivalent to a generalized PCoA which essentially incorporates an additional similarity kernel for variables into the analysis, as described in Proposition 1, but for

Simulation.In this section, we conduct a simulation study based on the smokeless tobacco data to illustrate a scenario in which the AMD-biplot may fail to separate the samples, whereas the GMD-biplot performs well. Let *w _{i}* that indicates the group index of the

*i*th sample as:

The GMD-biplot and the AMD-biplot of

To see why this occurs, we summarize the first three diagonal elements of *w _{i}* depends only on the first column of

Equation 2 implies that the diagonal elements of

## DISCUSSION

Biplots have gained popularity in the exploratory analysis of high-dimensional microbiome data. The traditional SVD-biplot is based on Euclidean distances between samples and cannot be directly applied when more general dissimilarities are used. Since Euclidean distances may not lead to an optimal low-dimensional representation of the samples, we have extended the concept of the SVD-biplot to allow for more general similarity kernels. The phylogenetically informed UniFrac distance, used in our examples, defines one such kernel. In settings where a general (possibly nonlinear) distance matrix is appropriate, our approach provides a mathematically rigorous and computationally efficient method, based on the GMD, that allows for plotting both the samples and variables with respect to the same coordinate system.

Our first data example with the smokeless tobacco data set from reference 11 demonstrates the merits of the proposed GMD-biplot. We found that the GMD-biplot successfully displays different types of products, while the AMD-biplot is not able to completely separate dry and moist snuff samples, and the SVD-biplot fails to separate moist and toombak samples. As shown in Table S1 in the supplemental material, the GMD-biplot is also able to identify biologically more meaningful taxa that are related to the different types of products, compared to the AMD-biplot and the SVD-biplot.

In our second example, the GMD-biplot also outperforms the AMD-biplot in terms of both the sample clustering and the identification of important taxa. However, there is not a clear advantage of the GMD-biplot over the SVD-biplot in this example. This difference between the two examples may be attributed to the relation between the Euclidean kernel and the non-Euclidean similarity measure. Denoting the Euclidean kernel and the non-Euclidean similarity measure by

In practice, we typically do not know what the true configuration of samples looks like, so it is impossible to determine whether

Our discussion in this paper has focused on the form biplot, which aims to visualize the relationship between variables and the sample clustering. In other scenarios, where the variation of the data matrix explained by each variable is of particular interest, the covariance biplot may be more appropriate. This biplot considers the GMD of *j*th variable. Note that when *q *= 2, the length of the arrow of the *j*th variable in the covariance biplot is given by

## MATERIALS AND METHODS

We denote the data matrix by *n* is the number of samples and *p* is the number of variables (taxa). We assume that the columns of *i*th row (sample) and its (*i*, *j*) entry by **m*** _{i}* and

*m*, respectively. We denote the transpose of

_{ij}Biplot, distance measure, and AMD.A biplot is a graphical method to simultaneously represent, in two dimensions, both the rows (as points) and columns (as arrows) of the matrix *K* identity matrix. Based on the SVD,

The SVD-biplot displays the first two columns of *i*th and *j*th sample as *n *× 1 vector of ones. It can then be seen that *i*, *j*) entry of

Now, if

Here,

GMD and the GMD-biplot.The concept of generalized matrix decomposition (GMD) was introduced by Escoufier (20) and further developed in reference 12. It is a generalization of the SVD with additional structural dependencies taken into consideration. We briefly review the key ideas behind the GMD. Let *q* approximation to

Proposition 1: The GMD solutions

Note that the GMD can handle the non-Euclidean similarity kernel **v*** _{j}* be the

*j*th column of

*i*th sample point can be configured by the coordinates of

**x**

*, given by*

_{i}*j*th variable, we consider the vector

*j*th element and 0’s elsewhere. Then, the arrow for the

*j*th variable can be configured by the coordinates of

**e**

*, given by*

_{j}*k*th diagonal element of

*k*= 1, 2.

Since the GMD values are nonincreasing, for the purpose of constructing the GMD-biplot, we can choose *q *= 2 in the GMD problem (equation 3), which may save considerable computational time. In contrast, since the AMD may produce nondecreasing “singular values,” we have to find the full decomposition of *n* and *p*.

Data availability.All data used are publicly available in references 11 and 14. All computations are conducted in the R programming language, and the proposed biplot is implemented in our R package “GMDecomp,” available at https://github.com/taryue/GMDecomp.

## ACKNOWLEDGMENTS

This work was partially supported by grant R01 GM114029 from the NIH. A.S. also acknowledges support from the NSF through grant DMS-1561814. J.M. and T.W.R. acknowledge support by grant R01 GM129512 from the NIH. T.W.R. also acknowledges support from NIH grants R01 CA192222 and P01 CA168530.

The content of this article is solely our responsibility and does not necessarily represent the official views of the NIH or any other funding agency.

We thank Parker Knight for his assistance with the software development.

## FOOTNOTES

- Received August 19, 2019.
- Accepted November 13, 2019.

- Copyright © 2019 Wang et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.