q2-longitudinal: Longitudinal and Paired-Sample Analyses of Microbiome Data

Longitudinal sampling provides valuable information about temporal trends and subject/population heterogeneity. We describe q2-longitudinal, a software plugin for longitudinal analysis of microbiome data sets in QIIME 2. The availability of longitudinal statistics and visualizations in the QIIME 2 framework will make the analysis of longitudinal data more accessible to microbiome researchers.

T ime is an important component in many microbiome studies. Sampling microbial communities repeatedly over time provides information on their development (1), stability (2)(3)(4), or response to and recovery from a treatment or disturbance (5)(6)(7). The frequency and scale of longitudinal sampling can range from pre-post studies, in which individual subjects are sampled before and after treatment (8), to long-term observation studies lasting months or years. Such studies benefit from the use of dynamic analytical methods, which evaluate trends over time in relation to one or more variables, paired methods, which evaluate the magnitude of change within individual subjects, and random-effects models, which account for the variation inherent to complex biological systems (9).
To facilitate routine application of appropriate longitudinal methods in microbiome studies, we developed q2-longitudinal (https://github.com/qiime2/q2-longitudinal), a suite of bioinformatics tools for paired and longitudinal analyses. This software package is a plugin for the microbiome bioinformatics platform QIIME 2 (https://qiime2.org/) These charts display "control limits" 3 standard deviations above and below the mean and "warning limits" 2 standard deviations above and below the mean to identify observations that deviate substantially from the mean. Observations at these time points might indicate aberrant conditions, e.g., due to disturbance or even sample contamination. Spaghetti plots, illustrating the longitudinal trajectory of each individual, support visual assessment of individual subjects' stability, identifying aberrant individuals and time points. Volatility charts, which combine the attributes of control charts and spaghetti plots, can be generated by using q2-longitudinal's "volatility" visualizer ( Fig. 1). This produces an interactive HTML-based visualization, allowing users to select which metric is plotted on the y axis, select which categorical sample metadata column is used to group (aggregate) individual samples into group averages at each time point, adjust color scheme and other plot formatting characteristics, and toggle error bars, control/warning limits, and "spaghetti" lines. See https://github.com/ caporaso-lab/longitudinal-notebooks for a gallery of examples.
Feature volatility. A principal goal in longitudinal experiments is to determine how microbial communities (e.g., taxa, sequence variants, or other "features") change over time. This may be, for example, in response to a treatment or during different stages of host development. The most common rudimentary way to do this is to simply examine the average relative compositions of species over time, e.g., as a bar plot or heatmap. In doing so, we see the most dominant taxa and their succession over time. This approach, however, is effective only for identifying the most abundant organisms, and longitudinal averages smooth the data, ignoring whether these features are actually associated with specific time points across individual subjects. Other approaches, such as looking at the magnitude of change in abundance between time points, are similarly blind to the temporal dependence of these organisms.
We implement a different approach in q2-longitudinal's "feature-volatility" action, which uses machine learning regressors (random forests [13] by default) to learn the structure of the data and identify features (including low-abundance features) that are predictive of different states. Important features identified by these methods change over time, and their abundance is predictive of the specific time point when a sample is collected, indicating a temporal relationship. Importantly, feature importance does not imply statistical significance; this is intended as an exploratory method for identifying potentially relevant features for subsequent investigation, as illustrated below. The accuracy of the model itself can be assessed to determine how well these features differentiate time points. Only features from accurate models are likely to be temporally informative.
The feature-volatility action produces an interactive visualization of the longitudinal abundance, importance, and descriptive statistics for all important features (Fig. 2). The longitudinal abundance of each feature is plotted using volatility plots (as described above). In addition, feature importances, descriptive statistics, and other feature metadata are plotted as bar charts, guiding exploration of these features by comparing multiple feature characteristics. Users can select which features are plotted in the control charts either by using the "metric column" selection menu in the tool panel to the right of the plots or by clicking on the bar associated with that feature in the feature metadata bar plots. See https://github.com/caporaso-lab/longitudinal-notebooks for a gallery of examples.
Longitudinal analysis of early-life microbiome development. To demonstrate some of the methods currently available in q2-longitudinal, we present a reanalysis of data from the early childhood and the microbiome (ECAM) study (1). This study tracked the 16S rRNA gene microbiota compositions of 43 infants in the United States sampled at regular intervals from birth to 2 years of age and associations between antibiotic use, delivery mode, and predominant diet on microbiota composition and development. Here, we focus on novel methods implemented in q2-longitudinal, as well as other methods in q2-longitudinal, that draw new discoveries from the ECAM study data.
Identification of longitudinally volatile features. The original ECAM study focused primarily on particular taxa that were dominant in the childhood gut environment and were strongly associated with cesarean section (e.g., Bacteroides) and antibiotic use. Less attention was given to taxa associated with development of the childhood gut in general and to changes associated with early diet. To identify bacterial genera associated with early-life gut development and dietary modes, we applied q2-longitudinal's feature-volatility action to the ECAM data set, showing that only a few genera constitute the most important features and can accurately predict a subject's age (mean square error ϭ 11.78, R 2 ϭ 0.79) using a holdout set for model validation; the top 10 features comprise 75.3% of the total feature importance (from among 71 total important features used to train the final model) (Fig. 2). From among these, we focused on two genera to examine their abundance in relation to host age and diet: Bifidobacterium ( Fig. 2A) and Faecalibacterium (Fig. 2B). These genera were chosen based on their relatively high importance, mean abundance, and cumulative average change (Fig. 2C). Bifidobacterium exhibited a period of increase in relative frequency from 0 to 6 months of life in all subjects and then decreased from 6 to 18 months of life ( Fig. 2A). Faecalibacterium exhibited a stable, low average frequency during the first 6 months of life and then increased from 6 to 12 months in most subjects (Fig. 2B).
We used LME models via q2-longitudinal's "linear-mixed-effects" action to test whether the relative abundances of these genera were impacted by time (age) and subject characteristics (see Materials and Methods for more information on LME). We fit three separate linear models to examine Bifidobacterium abundance between 0 to 6 and 6 to 18 months of life and Faecalibacterium between 6 to 12 months of life, since the trajectories appear approximately linear within each of these biologically sensible developmental phases (Fig. 2). Month, delivery mode, diet (predominantly breastfed or formula fed for the first 3 months of life), and sex were used as fixed effects; random intercepts Dashed lines indicate the developmental "windows" that were separately analyzed by LME as described in the text. C, feature metadata and other descriptive statistics for the top important features, ordered by decreasing importance. Bifidobacterium and Faecalibacterium are labeled "Bif" and "Fae," respectively. This list was truncated and does not contain all 71 important genera identified in this analysis.
(subject identifier [ID]) and slopes (month of life) were applied as random effects. Bifidobacterium relative abundance at 6 months of life was significantly impacted by diet (P ϭ 0.009), indicating higher relative abundance in dominantly breastfed children (by a factor of 0.33) ( Table 1). A significant interaction was also observed between diet and delivery mode (P ϭ 0.009) on Bifidobacterium relative abundance at 6 months of life (Table 1). However, no factors other than age (P Ͻ 0.001) significantly impacted Bifidobacterium relative abundance during months 0 to 6 ( Table 2). Faecalibacterium relative abundance was significantly associated with the interaction of age and delivery mode (P ϭ 0.021), indicating that monthly growth in Faecalibacterium relative abundance was reduced by a factor of 0.014 in children delivered by cesarean section (Table 3). However, no other factor significantly impacted Faecalibacterium relative abundance, though diet exhibited a nearly significant effect at baseline (P ϭ 0.053) ( Table 3).
Tracking temporal changes in subjects' beta diversities. Next, we sought to explore how beta diversity changed across time within and between groups in the ECAM study. q2-longitudinal contains multiple methods for visualizing and transforming longitudinal data, allowing us to examine individual trajectories in detail. In particular, first differences and first distances (see Materials and Methods for more details) enable inspection of individuals' rates of incremental change between time points, an analysis not considered in the original ECAM study.
We applied first distances to examine how beta diversity (unweighted UniFrac distance [14] between successive samples collected from the same subject) changed over time in each subject in the ECAM study (Fig. 3). Results indicate that vaginally and cesarean section-delivered infants exhibit similar rates of phylogenetic transition (Fig. 3A). This was marked by a dramatic shift in the first month of life, followed by gradual stabilization in the rate of change but a very large degree of variance. These groups diverged only in the first month of life (when cesarean section-born infants exhibited a higher degree of change within individuals) and after 2 years of life (when sample sizes and statistical power were lower). Consistently with the close similarity  between delivery modes, an LME test indicates no significant differences between delivery modes, diet, or sex (data not shown). The "first-distances" method also has a "baseline" parameter for calculating distance from a static time point (Fig. 3B). This can be a useful approach for assessing how a subject differs from the start/end of a study or from another static time point (e.g., to highlight fluctuations in community structure/composition related to a treatment) (Fig. 3B). This method accentuates the differences between vaginally and cesarean section-delivered infants during the first few months of life: cesarean section-delivered children exhibit greater phylogenetic change from baseline than do vaginally delivered children (Fig. 3B). Nevertheless, LME models indicate no significant differences between delivery modes, diet, or sex during this time period (testing months 0 to 6) or across the entire study period (2 separate tests [data not shown]).
Quantifying shared features across time. The first-distances method also allows us to track longitudinal change in the proportion of features that are shared between an individual's samples. This can be performed by calculating pairwise Jaccard distance (the proportion of features that are not shared) between each pair of samples with QIIME 2's "diversity" plugin and by using the "first-distances" method to extract distances between successive samples or from baseline. Furthermore, the baseline parameter in the first distances and first differences methods also provides the ability to track longitudinal change from a separate set of (nonlongitudinal) samples that are  linked to those samples. The ability to compare longitudinal samples to a set of static reference samples supports many different types of questions pertinent to longitudinal microbiome experiments, e.g., comparing similarity between the microbiotas of human patients or gnotobiotic animals receiving fecal microbiota therapy and the compositions of donor samples (7), between fermentations and their inocula, or between intact and disturbed environments during recovery from disturbance. We used first distances to track the number of shared features (Jaccard distance) between the stool microbiotas of infants and the stool microbiotas of their mothers near the time of birth in the ECAM data set (Fig. 3C). Jaccard distance between sequence variant profiles indicates that very few variants were shared with a child's mother during the first year of life, but distance decreases into the second year of life, when a higher proportion of sequence variants were shared between mother-infant dyads (Fig. 3C). A LME test indicates a significant impact from diet (P ϭ 0.001) and month of life (P Ͻ 0.001) on baseline Jaccard distance (Table 4). These results indicate that infants are born with various levels of similarity to their mothers and that initial dietary inputs alter this; formula feeding reduces baseline dissimilarity. As infants age, they accumulate more microbiota characteristics of an adult gut ecosystem, causing their gut microbiota to more closely resemble that of their mothers.
Conclusions. Longitudinal designs for microbiome studies provide valuable information about temporal trends in biological activity. In addition, these designs allow investigators to distinguish between within-and between-subject variation, an important issue in characterizing heterogeneity in temporal patterns across experiments.
The q2-longitudinal plugin supports a variety of paired-sample and longitudinal tests relevant to studies of host-associated and environmental microbiomes. This includes methods for paired difference and distance testing, LME, microbial interdependence, analyses of volatility, and a variety of functions for generating interactive plots for data exploration and publication. Additional functions will be added to this plugin as they are developed (e.g., additional methods for quantifying longitudinal volatility and shared species counts), and we welcome collaboration from other developers who would like their methods accessible through q2-longitudinal (get in touch on the QIIME 2 Forum at https://forum.qiime2.org/). This plugin is included in QIIME 2, and installation instructions and tutorials can be accessed at https://qiime2.org or https://github.com/qiime2/q2-longitudinal.

MATERIALS AND METHODS
The q2-longitudinal package (https://github.com/qiime2/q2-longitudinal) is written in Python 3 and is accessible as a QIIME 2 plugin (https://qiime2.org). As a plugin for QIIME 2, users automatically have access to q2-longitudinal simply by installing QIIME 2 and can interact with the plugin using a variety of user interfaces (command line, Python API, and graphical user interfaces are included in the Core Distribution). The actions in this plugin utilize SciPy (https://scipy.org), NumPy (15), and pandas (16) for data manipulation and statistical testing, q2-sample-classifier (17) for supervised regression, Vega (18) for interactive HTML-based visualizations, and Matplotlib (19) and seaborn (https://zenodo.org/record/ 12710) for static plots. Tutorials and other information about the q2-longitudinal plugin are available at We make extensive use of LME in this work and so will describe this method and some data transformations in more detail here.
Feature-volatility action. The feature-volatility action uses a supervised learning regressor to predict a continuous variable (e.g., age or time) as a function of feature composition (e.g., taxonomic composition). q2-longitudinal wraps q2-sample-classifier (17) to access multiple different scikit-learn (20) supervised learning regressors (random forests [13] by default).
1. Samples are randomly split into training and testing sets (4:1 ratio). 2. The training set is used to train a user-selected regressor.
a. If "parameter-tuning" is true, cross-validated hyperparameter autotuning will be performed on the training set to select samples that optimize predictive accuracy. See q2-sample-classifier (17) for more details. If "parameter-tuning" is false, default parameters will be used; see scikit-learn (20) for details on default parameters for each regressor. b. Cross-validated recursive feature elimination is performed on the training set to select features that optimize predictive accuracy and eliminate noisy features. See q2-sample-classifier (17)  Linear mixed effects. LME models examine the relationship between one or more independent variables (effects) and a single longitudinal response, where observations are made across dependent samples, e.g., in repeated-measures experiments. For example, a simple LME model may include an intercept and slope term as both fixed and random effects. The fixed intercept and slope can be interpreted as the regression line for the average subject, while the random effects reflect individual departures from the average line for each subject. This linear model can be written as y ij ϭ X= ij ␤ ϩ Z ij b i ϩ ij , where y ij is the jth measurement on the ith participant; X ij is a P ϫ 1 vector of fixed-effect covariates that may include, e.g., time, group, gender, and their interactions; ␤ is a P ϫ 1 vector of fixed-effect regression coefficients; Z ij is a q ϫ 1 vector of fixed effects that typically includes a polynomial function of time; b i is a q ϫ 1 vector of random effects that reflect individual departures from the overall population average effects (b i is assumed to be multivariate normal, with a mean vector o and variance ⌺, where ⌺ is a q by q matrix reflecting the covariances between the random effects); and ij are normally distributed error terms that have mean o and variance 2 and are assumed independent across individuals and time.
An attractive feature of the model is that it allows the investigator to explicitly model heterogeneity in the initial value and slope across subjects. This is important for longitudinal microbiome studies where we expect heterogeneity in the temporal pattern across individuals. Fixed effects are factors that may reflect group or time and assess the overall effect of the factor on the response. Random effects reflect variation in these effects across subjects. LME models are available in the linear-mixed-effects action in q2-longitudinal. All LME models described in this work for analysis of the ECAM data set used month of life, diet, delivery mode, and sex as fixed effects and random intercepts (subject ID) and slopes (month of life) as random effects. Initially, LME models were fit with all interactions between the main effects as fixed effects, and then insignificant terms were removed from the model to focus on main effects and significant interactions in the final model. Tables 1 to 4 list all factors used as fixed effects in the final models.
The linear-mixed-effects action in q2-longitudinal uses statsmodels' "mixedlm" function to compute LME models (21). This function computes P values based on each variable's Z-score estimate with respect to the standard normal distribution (two-tailed, alpha ϭ 0.05).
First differences/distances. Another way to view time series data is by assessing how the rate of change differs over time. We can do this through calculating first differences, which represent the magnitude of change between successive time points for a given metric. If Y t is the value of single metric Y at time t, the first difference at time t is ΔY t ϭ Y t -Y t -1 . This calculation is performed at fixed intervals, so for each interval, ΔY t is not calculated for subjects that are missing samples at time t or t -1.
This transformation is performed in the "first-differences" method in q2-longitudinal. A similar method implemented in q2-longitudinal is first-distances, which instead identifies the beta diversity (between-sample) distances between successive samples from the same subject based on a distance matrix. The output of first-distances is particularly empowering, because it allows us to analyze longitudinal changes in beta diversity using actions that cannot operate directly on a distance matrix, such as linear mixed-effect models, or plotting with volatility charts.
Difference/distance from baseline. The first-differences and first-distances methods have an optional baseline parameter to instead calculate differences/distances from a static point (e.g., baseline or a time point when a treatment is administered): ΔY t ϭ Y t Ϫ Y baseline .