Correcting for Microbial Blooms in Fecal Samples during Room-Temperature Shipping

In many microbiome studies, the necessity to store samples at room temperature (i.e., remote fieldwork) and the ability to ship samples without hazardous materials that require special handling training, such as ethanol (i.e., citizen science efforts), is paramount. However, although room-temperature storage for a few days has been shown not to obscure physiologically relevant microbiome differences between comparison groups, there are still changes in specific bacterial taxa, notably, in members of the class Gammaproteobacteria, that can make microbiome profiles difficult to interpret. Here we identify the most problematic taxa and show that removing sequences from just a few fast-growing taxa is sufficient to correct microbiome profiles.

IMPORTANCE In many microbiome studies, the necessity to store samples at room temperature (i.e., remote fieldwork) and the ability to ship samples without hazardous materials that require special handling training, such as ethanol (i.e., citizen science efforts), is paramount. However, although room-temperature storage for a few days has been shown not to obscure physiologically relevant microbiome differences between comparison groups, there are still changes in specific bacterial taxa, notably, in members of the class Gammaproteobacteria, that can make microbiome profiles difficult to interpret. Here we identify the most problematic taxa and show that removing sequences from just a few fast-growing taxa is sufficient to correct microbiome profiles. KEYWORDS 16S rRNA, DNA sequencing, bioinformatics T he use of sterile swabs is a convenient way to collect samples for microbiome studies, but in some cases, it is not feasible to immediately freeze or utilize a preservative. For example, the American Gut Project (AGP; Qiita study identifier [ID] 10317) allows members of the general public to send samples for 16S rRNA gene amplicon sequencing through domestic post without a preservative. This is because proven preservation methods can be cumbersome, dangerous, expensive, or sample type specific, complicating participation in microbiome citizen science. Although some studies have demonstrated that the effects of room-temperature storage are secondary to physiologically relevant differences between comparison groups (1-3), certain bacterial taxa, particularly those in the class Gammaproteobacteria, grow well at room temperature. This is problematic, as some Gammaproteobacteria species have been associated with disease, such as inflammatory bowel disease (IBD) (4). Therefore, to identify meaningful patterns in microbiome studies that do not utilize sample preservation, it is crucial to remove at high specificity the taxa that thrive at room temperature (i.e., "blooming" bacteria).
Here we performed a meta-analysis that combined fecal samples from storage experiments (low sample numbers but easily interpretable results) with bulk sample statistics from projects comparing room-temperature shipping to immediate freezing, identifying exact sequences corresponding to blooms by applying Deblur (5) to the data sets. We assessed whether any sequences are enriched more than expected in room temperature samples, producing a list of candidate sub-operational taxonomic units (sOTUs) or exact sequences that appear to increase in frequency at room temperature. We then filtered these exact sequences from the AGP data set, restoring a biological association that was obscured by the blooms. We further validate the procedure by confirming that filtered data sets more closely resemble those from immediately frozen samples and by showing that the overall microbiome profiles better match the results of other published human microbiome studies.
To identify the candidate blooming bacteria, we first examined the effect of roomtemperature storage on fecal microbiome samples. Using two recent storage studies (1, 2), we showed that taxonomic abundance changes over time in nonfrozen fecal samples compared to frozen samples are mainly due to a small number of taxa ( Fig. 1A to D). The taxa that contributed disproportionately are primarily members of the class Gammaproteobacteria, which is unsurprising given that many members of this class are easily cultivable, fast-growing and are commonly isolated from human stool. Unfortunately, these storage studies examined samples from a small number of individuals, and therefore it is possible that additional bacterial taxa bloom in samples shipped via domestic post that by chance were not present in these controlled studies. To address this limitation, we compared all AGP fecal samples (~7,000 samples) to data from 3 studies comprised of fecal samples immediately frozen (fresh-frozen) after collection (6,7; Personal Genome Project [PGP; unpublished data, Qiita study ID 1189]). Importantly, because each study represented a different population, it is likely that sOTUs were present at different frequencies across these studies. Nevertheless, blooming bacteria are expected to be at a higher frequency in AGP samples than in all of the fresh-frozen samples.
Using reasonable thresholds for relative abundance changes in the storage studies and in the AGP compared to fresh-frozen studies, we identified 20 bacterial sOTUs as candidates for blooming during shipping ( Fig. 1E; see Table S1 in the supplemental material) using the following criteria: a fold increase of 2 or more in the roomtemperature storage studies (1, 2) and AGP relative to fresh-frozen fecal samples from studies (6, 7; PGP) and a fold increase of 50 or more within the storage studies only or not observed in the storage studies but with at least a 2-fold change in AGP compared to the fresh-frozen studies. The results appear insensitive to these specific thresholds, as we found that removal of a subset of 10 of the identified candidate blooms from the AGP cohort was sufficient to restore a well-characterized age correlation with alpha diversity (Fig. 2E and F) and was sufficient for a significant decrease in the distances to fresh-frozen samples (see Fig. S1 in the supplemental material).
To mitigate the effect of these blooming bacteria on subsequent microbiome analyses, we removed exact sequence matches to identified blooms from 35,146 unique sOTUs identified by applying Deblur (5) to 10,189 samples spanning 338,496,967 sequences from the AGP data set. Each of the 20 blooms had an exact match to one of the unique Deblur sOTUs, and a total of 32,696,826 reads were removed (per-sample dropped sequences spanning 0.4%, 13.1%, and 45.3% for the 25th, 50th, and 75th percentiles, respectively). Importantly, some of the removed sequences were likely "real"; for example, Escherichia coli and Citrobacter sequences present in the candidate blooming list were present at nonnegligible frequencies in fresh-frozen samples. However, these sequences were included for removal as their tendency to grow during shipment can greatly impact the relative abundances of other organisms due to the compositional nature of the data.
Without filtering candidate blooms, there were notable differences (as observed using Bray-Curtis principal-coordinate analysis [PCoA]) between AGP fecal samples and the fresh-frozen fecal samples; filtering the bloom sequences from all samples removed these differences ( Fig. 2A versus B). In the PCoA space corresponding to the data determined without filtering, the primary separation is explained by the presence of a large percentage of bloom sequences (Fig. 2A); the sizes of the spheres are scaled by the percentage of bloom sequences in the respective sample. Following the removal of the blooms, this dominant effect was abolished and samples with high levels of blooms clustered with samples from the other studies (Fig. 2B). Similar results were observed  (Table S1).
in assessing class-level taxonomy abundances (Fig. 2 versus D): prior to filtering, a high relative abundance of Gammaproteobacteria (27%) was present in the AGP samples compared to the fresh-frozen samples (1.5% to 3.5%), while the AGP profile seen after filtering more closely resembled that of the fresh-frozen samples. Importantly, applying the filter minimally changed the taxonomic profiles of fresh-frozen samples (Fig. 2D). The filtering procedure is available in a Jupyter Notebook (8) at https://github.com/ knightlab-analyses/bloom-analyses.
There is a balance between type 1 and type 2 errors that must be considered in applying this filter. The cost of removing a sequence is that it becomes "invisible" in the analysis, and it is possible that real sequences are lost. Conversely, retaining a bloom sequence increases noise caused by shipment conditions, which can artificially alter biological conclusions. Therefore, a balance between loss of data and inaccurate, noisy data must be obtained. To select an appropriate number of blooming bacterial sequences to subtract from the AGP data set to maximize the amount of data retained while reducing inaccuracies caused by blooms, we tested the effect of nested filtering levels on the ability to detect the well-known effect of age on alpha diversity (9, 10). As can be seen in Fig. 2E and F, this effect was undetected by a Kruskal-Wallis test when none of the candidate blooms were removed. However, filtering the top four candidate blooms restored the ability to detect a significant difference in diversity by age. Critically, the identification of the bloom sOTUs was done independently of this positive control. For analysis of the AGP cohort, we recommend removal of the sequences of the top 10 candidate blooming bacterial taxa, as this maximizes the expected age effect (Fig. 2E). Different studies may want to remove a different subset of bloom sequences, as retaining some of these sequences might be critical, depending on the study characteristics. With meta-analysis, if this filter is applied, it must be applied identically to all samples represented to avoid introduction of a systematic bias.
Given that most bacteria change in relative abundance relatively little, filtering for blooms removes an important confounding variable and facilitates meta-analysis of projects that have used different storage procedures. We recommend this procedure to facilitate analysis of data produced from fecal studies without the means to immediately freeze or preserve samples such as citizen science efforts or remote fieldwork where it may be impossible to preserve samples immediately. Additionally, these data suggest that further control studies should be performed to allow the evaluation of candidate blooms and their impacts in nonfecal environments.