Multi-omics Comparative Analysis Reveals Multiple Layers of Host Signaling Pathway Regulation by the Gut Microbiota

Multiple host pathways were affected by its adaptation to the microbiota. We have found significant transcriptome-proteome discordance caused by the microbiota. This discovery leads to the definite conclusion that transcript-level analysis is not sufficient to predict protein levels and their influence on the function of many specific cellular pathways, so only analysis of combinations of the quantitative data determined at different levels will lead to a complete understanding of the complex relationships between the host and the microbiota. Therefore, our results demonstrate the importance of using an integrative approach to study host-microbiota interaction at the molecular level.

the transcriptomic and proteomic levels (Fig. 1A). The abundances of many gene products were statistically significantly affected by germ status (3,063 and 219 genes in the transcriptome [T] and proteome [P] data subsets, respectively). We treated the germfree status as the perturbed state and therefore compared GF mice to conventional mice throughout the analysis, defining "upregulated" (up) as corresponding to GF/C ratio values of Ͼ1 and "downregulated" (down) as corresponding to GF/C ratio values of Ͻ1.
The ileum transcriptome data were analyzed to discover transcripts that were affected by germ status (see Tables S1 and S2 in the supplemental material). A hierarchical cluster analysis (HCA) partitioned the probe-level data into two clusters: upregulated probes (approximately the top two-thirds), and downregulated probes (approximately the bottom third) (Fig. 1B). Most of the probes were also affected by strain. Specifically, germ status had a stronger effect on the transcriptome of C57BL/10A mice than on that of BALB/c mice, and these effects were largely sex independent (Fig. 1B). The ileum transcriptome data were also analyzed using InfernoRDN to discover transcripts affected by sex and/or strain (see Fig. S1B in Text S1 in the supplemental material) ( Table S3). The effect of mouse strain on the transcriptome was significant and largely independent of germ status (see Fig. S1A and B in Text S1). These results are consistent with earlier discoveries of strain-mediated differences in transcriptomes and proteomes of animal tissues. The hepatic transcriptome and proteome displayed significant variance across 97 mouse strains (33), and a similar result was found in a multi-omic study of liver from two rat strains (34). Additionally, numerous studies have identified host strain-mediated differences in host-microbiota interactions in the composition of the microbiota across 113 mouse strains (30); in Paneth cell population, antimicrobial peptide composition, and microbiota composition (31); or in microbiota composition and the colonic mucosa transcriptome (32).
The ileum proteome data were analyzed to discover protein groups that were affected by germ status, sex, and/or strain (Table S4). An HCA was used to analyze the protein groups significantly affected by germ status, and the rows partitioned into two clusters of roughly equal sizes (Fig. 1C): a downregulated (top half) protein group and an upregulated (bottom half) protein group. In addition to being affected by germ status, some of these protein groups were also affected by mouse strain. As with the transcriptome data (Fig. 1B), germ status had a stronger effect on the C57BL/10A proteome than on the BALB/c proteome. Interestingly, the effect was milder in the proteome than in the transcriptome.
HCAs were also performed using the set of protein groups that were significantly affected by the other experimental parameters (see Fig. S1C and D in Text S1). As with the transcriptome data (see Fig. S1A and B in Text S1), mouse strain had a strong effect on the ileum proteome. We discovered that 63 proteins were both significantly affected by germ status and significantly unaffected by both sex and strain. An HCA of these proteins was performed (see Fig. S1D in Text S1), and the rows partitioned into two clusters of roughly equal sizes: an upregulated protein group and a downregulated protein group. We observed that the highly expressed genes were more frequently identified and quantified at the protein level and were more frequently classified as significantly affected by germ status at both the transcript and protein levels (see Fig. S2 in Text S1). Our proteome data subset was compared to a quantitative shotgun proteomics data set from a previous report (10 male GF C57BL/6 mice; at 57 to 62 days of age, 5 were conventionalized and the other 5 were mock conventionalized; proximal colons were harvested 14 days later) (23). The genes that were significantly affected by germ status within both sets of data were used for a correlation analysis, and the correlations were statistically significant (see Fig. S3A in Text S1).
The significantly affected genes in the transcriptome and proteome data subsets were tallied and analyzed using HCAs (Fig. 2). Most of the genes that were quantified within both data subsets and significantly affected within at least one of them were unaffected within the other. However, among the genes that were significantly affected within both data subsets, the effects were much more likely to be concordant than discordant. Data from a correlation analysis showed a highly significant correlation (see Fig. S3B in Text S1). This observation is consistent with numerous reports of moderate correlation of transcriptome and proteome abundance ratios from analyses of perturbed versus control samples (35). Germ status had a relatively strong effect on the C57BL/10A mice compared to the BALB/c mice, similarly to the results depicted in Fig. 1B and C.
Adaptation of the ileum to the microbiota primarily affects metabolism and the immune system. The overlapping genes from the transcriptome and proteome data that were significantly affected by germ status were used to produce interset concordant and discordant gene sets (described in detail in Materials and Methods; a total of 30 gene sets were produced). Ingenuity pathway analysis (IPA) was used to analyze these sets of genes to discover enriched biological pathways and other gene annotations (a total of 428 biological pathways were significantly enriched [gene annotation enrichment false-discovery rate {q e } ϭ Ͻ0.1] in at least 1 of the 30 analyses; Tables 1, S5, S6, and S7). The pathway annotations overlapped more than the functional annotations (see Fig. S4 in Text S1). The microbiota caused upregu-lation of many immune system pathways in the conventional mice, and most of this was transcriptome-proteome concordant. It is likely that the microbiota causes an upregulation of intestinal immune system pathway gene expression in nonimmune cells. Also, we have previously reported that the microbiota causes an increase in immune cell populations in the terminal ileum (14), and this was likely a major cause of the observed upregulation. The microbiota also caused downregulation of many metabolic pathways in the conventional mice. Some of this dysregulation effect was transcriptome-proteome concordant, and some was discordant. The microbiota is known to perform numerous metabolic functions, and its absence likely causes the host to upregulate some of the corresponding pathways.

FIG 2
Concordance across the transcriptome and proteome data subsets. Genes were either upregulated due to germ status (red, GF/C Ͼ 1, q min Ͻ 0.1), downregulated (blue, GF/C Ͻ 1, q min Ͻ 0.1), unaffected (gray, q min Ն0.1), or unquantified (n.q.; black). Tallies were used to quantify the overall levels of concordancy (light green) and discordancy (light yellow and red). The overlapping, significantly affected genes were analyzed using HCAs (only rows were clustered; gray ϭ missing value). Each row depicts a gene, and each column depicts a pair of mice (1 GF, one conventional, sex and strain matched, from one microarray; same pairing for the proteome and transcriptome columns). Each standard deviation ( st ) was calculated across the columns independently for each data subset and mouse strain using the log 2 -transformed GF/C ratios.
Seventy-nine pathways were downregulated in the transcriptome and/or proteome data subsets, and many were immune system pathways. Twenty-two pathways were downregulated in both subsets, and almost all were immune system pathways such as the antigen presentation pathway (Fig. 3A). Forty-three pathways were upregulated in the transcriptome and/or proteome, and almost all were metabolism pathways. Only three pathways were upregulated in both: the glutathione-mediated detoxification pathway, the serotonin degradation pathway, and the xenobiotic metabolism signaling pathway. Thirty-two pathways were downregulated in the concordant[T,P] gene set, and most were immune system pathways. Thirty-nine pathways were upregulated in the concordant[T,P] gene set, and almost all were metabolism pathways.
Network analyses of the significantly enriched biological pathways were performed. Each used 1 of the 30 sets of genes significantly affected by germ status. For each gene set, five networks were examined (the minimum number of overlapping significantly affected genes per edge examined ranged from one to five). Network analysis of the proteome data subset (see Fig. S5A in Text S1) and of the concordant[T,P] gene set ( Fig. 3B) resulted in two clusters. One was principally composed of downregulated immune system pathways, and the other was principally composed of upregulated metabolic pathways. Network analysis of the invariant[P, up] gene set produced a cluster of upregulated metabolic pathways (see Fig. S5B in Text S1) similar to the proteome and concordant[T,P] network results. This network might constitute the core set of pathways that are normally resistant to perturbation but are nevertheless altered by the microbiota.
Ileal pathway adaptation to the microbiota is partially transcriptome-proteome discordant. Transcriptome-proteome concordance is consistent with regulation occurring mainly at the transcriptional level. In contrast, gene products being significantly affected at the transcriptome level and not at the proteome level (discordant[T\P]), or the reverse (discordant[P\T]), would mean that regulation occurs principally via posttranscriptional or cotranslational regulation, targeted protein degradation, or protein secretion. There were 5 instances of opposite regulation at the transcriptomic and proteomic levels, 113 instances of significantly affected proteins and unaffected transcripts, and 578 instances of the reverse (Fig. 2). These data were analyzed to discover differentially regulated biological pathways.
Four pathways were significantly enriched in the discordant[T\P] gene set, and all four were upregulated. Two of these pathways, the eukaryotic initiation factor 2 (eIF2) signaling and mechanistic target of sirolimus (mTOR) signaling (see Fig. S6 in Text S1) pathways, highly overlapped. These pathways included much of the protein translation machinery. The significantly affected genes annotated with the Ingenuity biofunctions "Processing of RNA" and "Translation" were analyzed, and a similar pattern was discovered (intratranscriptome concordancy and transcriptome-proteome discordancy; see Fig. S7 in Text S1). All of the significantly affected eIF and ribosomal protein (RP) genes were found to be upregulated in the transcriptome data subset, and almost all were transcriptome-proteome discordant (Fig. 4A). Intraset concordancy of all of the eIF, 40S RP, and 60S RP transcript-and protein-level data was analyzed, and it was discovered that the eIF, 40S RP, and 60S RP transcripts were upregulated, the eIF proteins were unaffected, and the 40S and 60S RPs were downregulated (Fig. 4C). Generally, such differences may result from changes in posttranscriptional regulation, mRNA export and localization, cotranslational regulation, protein secretion, or targeted protein degradation (35).
All RP genes and many eIF genes are known to be posttranscriptionally regulated via their 5=-terminal oligopyrimidine (5=TOP) tract (36,37). In the inactive state, LARP1, TIA1, or TIAL1 is bound to the 5=TOP tract, which represses translation (38)(39)(40). LARP1 binding increases transcript stability, and TIA1 and TIAL1 binding targets the transcript to stress granules for mRNA triage. Mechanistic target of sirolimus (mTOR) complex 1 activation antagonizes repression of translation by LARP1, TIA1, and TIAL1. Posttranscriptional regulation by LARP1, TIA1, and/or TIAL1 may have caused the transcriptomeproteome discordancy that we observed. In addition, it is also possible that the The antigen presentation pathway was downregulated in the GF mice. Each pathway node was dually colored using the transcriptome (left) and proteome (right) data subsets. The immunoproteasome complex is from the protein ubiquitination pathway. The statistical significance data were q e (transcriptome, downregulated) ϭ 1.0EϪ14, q e (proteome, downregulated) ϭ 6.6EϪ10, q e (meta-analysis, downregulated) ϭ 1.9EϪ3, P c (transcriptome) ϭ 3.4EϪ3, and P c (meta-analysis) ϭ 3.9EϪ2. The proteome node-level intraset concordancy data were not statistically significant, but they were close (P c ϭ 6.5EϪ2; significance level ϭ 0.05). CALR, (Continued on next page) Multi-Omics Reveals Microbiota-Regulated Host Pathways microbiota caused a decrease in transcript stability via LARP1, TIA1, and/or TIAL1, resulting in the observed concordant downregulation of these transcripts in the conventional mice. A second 5= untranslated region tract called the pyrimidine-rich translational element is also involved in mTOR-mediated translation upregulation of many RP transcripts (41), and it is possible that regulation at this tract was involved.
The other two discordant[T\P] pathways, the OxP and mitochondrial dysfunction pathways, also highly overlapped. The OxP pathway nodes were intratranscriptome concordant (all 31 were upregulated in the GF ilea), and they were all transcriptomeproteome discordant except for Atp5a1 (Fig. 4B). Mouse gene expression of OxP genes was found to be coregulated using a coexpression network analysis (data not shown), as previously reported (42). Intraset concordancy of all of the OxP transcript-and protein-level data was analyzed, and it was discovered that the OxP transcripts were upregulated and that the OxP proteins were unaffected (Fig. 4C).
Gene expression of OxP genes across 61 mouse tissues was previously reported to be coregulated (42). As with the eIF and RP genes, it is possible that the microbiota affects the transcription rate and/or transcript stability of OxP genes, resulting in the observed concordant downregulation of these transcripts in the conventional mice. Interestingly, we previously reported that transcripts encoding ribosomal and OxP proteins were significantly downregulated in the mouse ilea in response to treatment with either antibiotics or antibiotic-resistant microbes and that these effects were mediated by virulence factors and quorum sensing by antibiotic-resistant bacteria (14).
Thirty-seven pathways were significantly enriched in the discordant[P\T] gene set. Twenty-one were upregulated, 16 were downregulated, and none were intraset discordant. A network analysis of the 21 upregulated pathways resulted in a cluster of upregulated metabolic pathways (see Fig. S8A in Text S1). Thus, the proteomic analysis discovered many upregulated metabolism pathway elements that were missed by the transcriptomic analysis. Interestingly, of the 21 upregulated discordant[P\T] pathways, 12 were also upregulated invariant[P] pathways.
Fifteen of the 16 downregulated discordant[P\T] pathways contained chaperone proteins that were significantly affected by germ status. Therefore, the interset concordancy and intraset concordancy/discordancy of the chaperones and cochaperones in the transcriptome and proteome data subsets were analyzed. There were 310 (co)chaperone genes corresponding to 428 microarray probes in the transcriptome; 206 probes were upregulated, 222 probes were downregulated, and the results were found not to be significantly intraset concordant (P c ϭ 0.47). Forty-three (co)chaperone genes (44 microarray probes) were significantly affected by germ status in the transcriptome data subset. Of the 44 significantly affected probes, 31 were upregulated and 13 were downregulated, which represented significant intraset concordance (P c ϭ 9.6EϪ3). All 43 of the (co)chaperone genes affected at the transcript level were unaffected at the protein level, with one exception: Ifit1, an interferon-induced cochaperone, was downregulated at both levels.
In the proteome data subset, there were 107 (co)chaperone genes corresponding to 100 nonredundant quantified protein groups. Of the 100 nonredundant quantified protein groups, 37 were upregulated and 63 were downregulated, representing significant concordance (1.2EϪ2). Of the 100 nonredundant quantified protein groups, 15 were significantly affected by germ status, and all 15 were downregulated, representing significant intraset concordance (P c ϭ 6.1EϪ5). Many of these 15 proteins have been reported to directly associate with each other, and all 15 genes were unaffected in the transcriptome data subset except for the Ifit1 gene (Fig. 4D). Thus, the (co)chaperone transcripts and proteins were affected oppositely (transcripts upregulated and proteins downregulated). Consistently, the full (co)chaperone transcriptome and proteome data subsets were found to be significantly discordant (Mann-Whitney P value [P MW ] ϭ 5.0EϪ3). To identify concordance within the (co)chaperone data more precisely, the chaperone and cochaperone data were analyzed separately. The chaperones were not concordantly affected at the transcript level, the cochaperones were concordantly The OxP pathway also displayed significant intraset concordancy [q e (transcriptome, up) ϭ 3.4EϪ6, P c [transcriptome] ϭ 9.3EϪ10] and transcriptome-proteome discordancy (cyan outlines). (C) All of the following categories of transcriptome and proteome data related to the four annotations were analyzed: eIFs, the 40S ribosomal complex (RPS), the 60S ribosomal complex (RPL), and the OxP pathway. Each triangle depicts a microarray probe (transcriptome [Trans.] data subset) or a protein group (quantified nonredundantly; proteome [Prot.] data subset). Red and blue triangles depict significantly increased and decreased gene products, respectively (gray triangles ϭ not significantly affected). The lines indicate each median and interquartile range. Each binomial test either used all of the data or was restricted to just the significantly affected data, as indicated. n/a, not applicable. (D) A protein-protein interaction network was prepared using all of the chaperones (rectangles) and cochaperones (ovals) that were significantly affected by germ status within the proteome data subset, and the intraset concordance was significant (P c ϭ 6.1EϪ5). Also included are the E3 ubiquitin ligases (yellow hexagons) that have been reported to associate with chaperones and typically target misfolded/unfolded proteins. Edge weight data depict the protein-protein interaction confidence (increasing weight values depict low, medium, high, and very high confidence). (E) Summary of the effects on the GF mouse intestine. The decrease in immune cell populations was reported previously by Morgun et al. (14). upregulated at the transcript level, and the (co)chaperones were concordantly downregulated at the protein level (see Fig. S8B in Text S1). This finding is consistent with numerous reports that the microbiota causes upregulation of host intestinal chaperone protein abundance, principally in the epithelium, and that the level of upregulation is typically highest in the cells that are closest to the microbiota (43,44). The innate immune response is known to cause upregulation of heat shock protein gene expression, at least partially as a consequence of cross talk between the innate immune response and the unfolded protein response (43,45).
All but 1 of the 15 (co)chaperones downregulated at the proteome level were not affected at the transcriptome level (Fig. 4D). This may have resulted from the microbiota affecting (co)chaperone posttranscriptional regulation, protein secretion, and/or protein degradation. Cellular export of chaperones has been reported, though the mechanism is unclear (46,47), and it is possible that the microbiota somehow downregulates (co)chaperone export from cells of the mouse ileum, resulting in the observed discordancy. Some chaperones are known to associate with the E3 ubiquitin ligases that typically target unfolded/misfolded client proteins, resulting in their degradation (48).
We noted that some (co)chaperones were upregulated in the GF ilea in the transcriptome data subset and were unaffected in the proteome data subset. It is possible that transcription, translation, and protein degradation of these (co)chaperones were upregulated in the GF mice, resulting in increased abundance in the transcriptome and no significant net change in the proteome. We also noted that Stub1 itself was upregulated in the transcriptome data subset but was unaffected in the proteome data subset. STUB1 is known to autopolyubiquitinate itself (49), which may result in its degradation, and this may explain the discordancy that we observed. Intriguingly, PARK2 has a key role in derepressing translation of OxP transcripts, and it is also a chaperone-associated E3 ubiquitin ligase that typically targets unfolded/ misfolded client proteins, resulting in their degradation (48). It is possible that PARK2 has a dual role in the host response to the microbiota, linking derepression of translation with chaperone-associated ubiquitination.
A summary of the ileum data set results is provided as Fig. 4E. Adaptation of the intestinal transcriptome to the microbiota varied significantly across globally diverse murine and microbiota populations. A meta-analysis was performed to discover host genes that were affected by the microbiota (Fig. 1A). The meta-analysis used the microarray data originating from a globally diverse set of samples that varied in microbiota population, host strain, host sex, tissue location within the intestine, and sample preparation protocol (Table S8). The abundances of 2,132 genes were statistically significantly affected by germ status (Table S9). This is the first report of such an analysis. Ideally, a corresponding meta-analysis using proteomics data would also have been performed, but our proteomics comparison of conventional and germfree mouse intestinal tissue is the first of its kind.
An HCA of the meta-analysis data partitioned the rows into two clusters of roughly equal sizes: one of upregulated genes and the other of downregulated genes (Fig. 5A). The HCA also partitioned the columns (depicting comparison group pairs) into two clusters. Germ status had a relatively strong effect on the leftmost cluster (14 columns; dendrogram highlighted green in Fig. 5A). This cluster contained mice from two studies: male C57BL/6J mice (jejunum, ileum, and colon; conventionalized for 4, 8, 16, and 30 days; GEO dataset accession no. GSE32513), and female C3H/HeN mice (ileum, conventional and conventionalized using a mouse microbiota for 8 and 60 days; GEO dataset accession no. GSE18056). From left to right, the columns of the green cluster Fig. 5A represent 11 GSE32513 columns (3 colon columns followed by 4 ileum columns and 4 jejunum columns) followed by 3 GSE18056 columns. The colon 30-day GSE32513 column data were aberrantly absent from the green cluster.
In comparison to the green cluster, germ status had a relatively weak effect on the other cluster (17 columns; dendrogram highlighted in yellow in Fig. 5A). It is likely that six columns of the yellow cluster were weakly affected because they originated from The transcriptome data subset and meta-analysis data set were partially concordant. (A) An HCA of the meta-analysis data set resulted in clustering of relatively strongly and weakly affected mice (indicated by the green and yellow dendrogram highlighting, respectively). Each row depicts a signifi-comparisons of GF versus conventionalized mice at early time points (days 1 and 2; all from the GSE32513 study). There was no obvious reason for the relatively weak effect shown in the other 11 columns. However, two columns of the yellow cluster (columns 4 and 5, counting from the left) represented data from mice that had been conventionalized using a human microbiota (GSE18056 study; 8 and 60 days, respectively), and it is possible that human-mouse microbiota differences somehow caused the relatively weak effect. We noticed that one column of the yellow cluster (column 7, counting from the left) was the colon 30-day GSE32513 column mentioned above and that the bottom half of this column roughly matched the corresponding data in the green cluster. In summary, the meta-analysis revealed that the mice were affected by germ status at two levels: a relatively low level and a high level. It is possible that this dissimilarity was due to the different mouse strains and/or different microbiota populations examined. The pattern of relatively strong and weak effects in the meta-analysis data set was consistent with the strain-dependent patterns seen in both the transcriptome and proteome data subsets (Fig. 1B and C).
An HCA of the significantly affected genes in the transcriptome data subset and the meta-analysis data set was performed (Fig. 5B), and the row and column partitioning was similar to that described above ( Fig. 1B and C, 2, and 5A). With a single exception (the aberrant colon 30-day GSE32513 column), the strong/weak partitioning of the meta-analysis columns was the same in Fig. 5A and B. Similarly, the strong/weak partitioning of the transcriptome columns was the same in Fig. 1B and 5B except for a single BALB/c column. A correlation analysis was performed, and the correlation was highly significant (see Fig. S9A in Text S1). Reassuringly, the concordant changes between the transcriptome data subset and meta-analysis data set validated each other. The overall concordance across the transcriptome, proteome, and meta-analysis sets of data were examined (see Fig. S9B in Text S1). A significantly affected gene in one set was most likely to be unaffected or unquantified within the other two sets. However, if a gene was significantly affected within multiple sets, the effects were more likely to be concordant than discordant.
Adaptation of the intestine to the microbiota affects hundreds of biological pathways. Three hundred eighty-four Ingenuity pathways were significantly enriched in the meta-analysis data set (Table 1). Two hundred sixty-eight (70%) of the pathways were not enriched in the transcriptome or proteome data subsets (28 were downregulated, 60 were upregulated, and 180 were intraset discordant). Thus, the meta-analysis greatly broadened the discovery of pathways affected by germ status. As with the ileum data set, some of these pathways were downregulated immune system pathways and upregulated metabolic pathways, but the pathways varied greatly overall.
Twenty-eight pathways were downregulated in both the transcriptome and metaanalysis sets of data, and almost all were immune system pathways such as the interferon signaling pathway (Fig. 5C). Only three pathways, the glutathione-mediated detoxification pathway, the serotonin degradation pathway, and the xenobiotic metabolism signaling pathway, were upregulated in both the transcriptome analysis and cantly affected gene (n ϭ 2,132), and each column depicts an intrastudy pair of groups of mice (a "comparison group pair" consisting of a conventional group and a GF group) (both rows and columns were clustered; gray ϭ missing value). (B) An HCA of the intersection of the transcriptome data subset and the meta-analysis (MA) data set resulted in clustering of strongly and weakly affected mice (indicated by the green and yellow dendrogram highlighting, respectively). Each row depicts a significantly affected gene (n ϭ 530), each transcriptome column depicts a pair of mice (1 GF and one conventional from one microarray; sex and strain matched; red and lavender bars), and each meta-analysis column depicts an intrastudy "comparison group pair" (orange bars) (q min [germ status] Ͻ 0.1 required for both sets of data [concordance of GF/C ratios was not required]; both rows and columns were clustered; gray ϭ missing value). (C) The interferon signaling pathway was downregulated in the GF mice. Each pathway node was dually colored using GF/C ratios from the sets of meta-analysis (left) and transcriptome (right) data. The statistical significance data were as follows: q e (transcriptome, downregulated) ϭ 1.6EϪ6, q e (proteome, downregulated) ϭ 4.6EϪ2, q e (meta-analysis, downregulated) ϭ 1.7EϪ3, P c [meta-analysis] ϭ 3.9EϪ2. The transcriptome node-level intraset concordancy was not statistically significant, but it was close (P c ϭ 5.7EϪ2; significance level ϭ 0.05). meta-analysis, and these three were also upregulated in the proteome. Ninety pathways were downregulated in the concordant[T,M] gene set, and most were immune system pathways. Twenty-one pathways were upregulated in the concordant[T,M] gene set, and almost all were metabolism pathways.
Fifteen pathways were downregulated in the sets of transcriptome, proteome, and meta-analysis data, and almost all were immune system pathways such as the antigen presentation pathway (Fig. 3A) and the interferon signaling pathway (Fig. 5C). As mentioned above, three pathways were upregulated in all three sets of data. Eight pathways were downregulated in the concordant[T,P,M] gene set, and almost all were immune system pathways. Twenty-eight pathways were upregulated in the concordant[T,P,M] gene set, and almost all were metabolism pathways. Notably, the aryl hydrocarbon receptor signaling pathway was significantly enriched and intraset discordant in all three sets of data.
Network analysis of the concordant[T,M] gene set resulted in three clusters (Fig. 6A). The top-right cluster in the figure panel primarily consists of upregulated metabolic pathways, the bottom-center cluster primarily consists of downregulated immune system pathways, and the third cluster (upper left) partially consists of downregulated immune system pathways. Interestingly, the upper half of the third cluster was principally composed of intraset discordant cancer pathways. In a related result, the colorectal cancer metastasis pathway was enriched and intraset discordant in the metaanalysis data set (q e [up ϩ down] ϭ 1.6EϪ12). This pathway was not enriched in either the transcriptome data subset or the proteome data subset. Thus, the microbiota affected numerous cancer pathways at the transcript level. Network analysis of the concordant[T,P,M] gene set resulted in two clusters (Fig. 6B). One was principally composed of downregulated immune system pathways, and the other was principally composed of upregulated metabolic pathways. This network might constitute the core set of pathways that are most broadly and strongly affected by the microbiota.
Conclusions. The microbiota is an important factor for human health and biomedical research, and GF animals are a valuable tool for discovering microbiota effects on the host. A hypothetical model of some of the results is summarized in Fig. S10 in Text S1, and compendia of illustrations of the affected pathways have been prepared and are available upon request. So far, most system-level studies comparing conventional and GF animals have focused on transcriptomic analyses. Although it has been found that protein abundance under steady-state conditions is primarily determined by transcript abundance, the spatial and temporal variations of mRNAs, the local availability of resources for protein biosynthesis, posttranscriptional regulation, and targeted protein degradation strongly influence the relationship between protein levels and their coding transcripts. Transcript fluctuations can be buffered at the level of protein concentration; consequently, analysis of transcript levels alone is generally not sufficient to predict protein levels in many scenarios (50), including during hostmicrobiota interactions. To address this, we performed the first proteomic comparison of the gastrointestinal tract of conventionally raised and GF mice alongside a corresponding transcriptomic analysis. Multiple gene products and hundreds of biological pathways were found to be affected by germ status in the mouse terminal ileum, and protein-level variance often did not reflect a similar change in the corresponding transcript level.
While transcriptomics analysis of small samples (e.g., from laser capture microdissection) is now routine, proteomics technology needs to continue to be researched and developed to enable deep analyses of small-mass samples. When such analyses become possible, this work will provide the foundation for the studies of cell-specific changes in the proteome caused by the microbiota. Now, on the basis of our analysis and our previous work, we can speculate that some of the changes that we observed at the proteome and transcriptome levels are caused by changes in the abundance of specific cell populations in the ileal tissue. Our previous study (14) had shown enrichment of T and B lymphocytes caused by the microbiota. However, this type of analysis does not discriminate between subpopulations, which would be more interesting to analyze. Therefore, future studies using single-cell transcriptome sequencing (RNA-seq) would enable addressing this issue.
Our report demonstrates that transcript abundance cannot always reliably predict protein abundance differences caused by the perturbations of the microbiota; provides a resource to study the effect of the microbiota on the host at the functional, protein, and pathway levels; and reinforces the idea of the importance of measuring protein abundance alterations to reveal molecular mechanisms behind the phenotypic changes caused by changes in the microbiota.

MATERIALS AND METHODS
Mouse ilea. This study was approved by the NIH Animal Care and Use Committee. Rederived GF and conventional mice (female and male; BALB/c and C57BL/10A strains; 6 to 8 months old) were prepared by Taconic Biosciences, Inc. (Hudson, NY). The conventional mice were housed under normal conditions and maintained using standard procedures. Six GF mice were prepared for each experimental class (two sexes, two strains, 24 GF mice in total) with the goal of having 5 mice per experimental class (two germ statuses, two sexes, two strains, 40 mice in total). Samples (e.g., feces) from the GF mice were analyzed to ensure the complete absence of any microbial contamination, and mice were discarded if any sample tested positive. This resulted in only 4 GF female C57BL/10A ilea being available. The ilea were harvested, placed into and gently flushed with ice-cold phosphate-buffered saline (PBS) to wash away microbes, segmented, and immediately frozen on dry ice. One set of the segments was used for the transcriptomics experiments, and another set was used for the proteomics experiments. This resulted in the production of data referred to here as the "ileum data set," consisting of the transcriptome and proteome "data subsets." Transcriptomics. Total RNA was isolated from the ilea using Total RNeasy minikits (Qiagen, Hilden, Germany). Fluorescently (green and red) labeled cDNA was prepared using 20 g of total RNA and 2 l of 1 mM Cy3-dUTP and Cy5-dUTP, respectively (GE Healthcare, Chicago, IL), 1 g of oligo(dT) 12-18 primer, 0.5 mM dATP, 0.5 mM dCTP, 0.5 mM dGTP, 0.15 mM dTTP, 10 mM dithiothreitol, 1.5 l of 200 U/l SuperScript II reverse transcriptase, and 0.5 l of 20 U/l RNase inhibitor (all from Thermo Fisher Scientific Inc., Waltham, MA, unless noted otherwise). The resulting cDNA was purified using Vivaspin 500 columns (Sartorius AG, Göttingen, Germany) (10,000 molecular weight cutoff [MWCO]). Each sample pair (red and green; 38 samples in total) was cohybridized to a gene microarray (Agilent Technologies, Santa Clara, CA) (45,220 spots, 37,558 reporter spots, 35,958 reporter probes, 60-mer probes, "4 by 44 k" slide format, design number 017138) using a MicroArray User Interface (MAUI) hybridization system (BioMicro Systems Inc., Salt Lake City, UT). A reference-free experimental design was used. Each microarray was used to analyze 1 GF-mouse sample and one conventional-mouse sample (sex and strain matched). To mitigate label effects, half of the samples of each experimental class were randomly selected to be labeled green and the other half red. The dual-color microarrays were scanned using a Microarray Scanner System, and the resulting images were processed using Feature Extraction (both Agilent Technologies). These data are available at the Gene Expression Omnibus data repository at the National Center for Biotechnology Information (GEO Platform GPL9354; GEO entry identifier GSE95650; http:// www.ncbi.nlm.nih.gov/geo/). Gene symbols were retrieved from bioDBnet (12 March 2016; https://biodbnet-abcc.ncifcrf.gov/) (51) using the gene and transcript identifiers within the GEO Platform data set. This resulted in 22,861 mouse gene symbols being associated with 29,819 reporter probes. Each reporter spot foreground and background intensity value represented the corresponding median pixel intensity value. Each spot background intensity value was set to be a maximum of four times the median background intensity value (calculated across each microarray and color channel; this affected 251 of the 1,427,204 background intensity values). For each ileum sample and spot, if the net intensity (foreground Ϫ background) value was Ͻ10, it was replaced with a random number between 10 and 20 (this range was roughly consistent with the range of the background values, as the median background intensity value was 35.5; this affected 31,139 of the 1,427,204 net intensity values). For each microarray/color channel/probe, the arithmetic mean of the net intensity values was calculated across replicate spots. The resulting values were log 2 transformed.
A statistical analysis was performed using BRB-ArrayTools (http://brb.nci.nih.gov/BRB-ArrayTools/) (52) to identify probe intensity values that were significantly affected by germ status. To this end, intra-array LOWESS normalizations (span ϭ 0.4) were performed. The resulting data were used for analyses of variance (ANOVAs) that used a statistical model designed specifically for reference-free dually labeled microarray data (53). The BRB-ArrayTools "Random Variance Model" option was used. The ANOVA model estimated array effects as well as the effects due to the eight experimental classes (two germ statuses, two sexes, two strains). The resulting data were used for post hoc pairwise comparison testing across the experimental classes (35,958 probes ϫ 28 pairs of experimental classes ϭ 1,006,824 tests), which resulted in a q value (i.e., a false-discovery-rate value) for each test. For each probe, the minimum q value (q min ) was calculated across the four relevant pairs of experimental classes (GF mice versus conventional mice, sex and strain matched). The data were filtered by requiring q min values to be Ͻ0.1. This resulted in 3,817 probes (10.6%) passing this filter, and this corresponded to 3,063 genes (13.4%). Pearson and Spearman correlation P values were calculated using 2-tailed t tests. HCAs were performed using Genesis (complete linkage clustering using Euclidean distance) (54). For each probe, the arithmetic mean of the log 2 -transformed and LOWESS-normalized GF/C (germfree/conventional) ratios (described above) was calculated across the microarrays, and the anti-log 2 values determined were used as the GF/C values for the pathway analyses.
An additional statistical analysis was performed using InfernoRDN (https://omics.pnl.gov/software/ infernordn) (55) to identify probe intensity values that were significantly affected by sex and/or strain. To this end, an interarray LOESS (locally weighted scatterplot smoothing) normalization (reference ϭ median value, span ϭ 0.4) was performed. The resulting data were used to perform a 3-way ANOVA (germ status, sex, and strain; an array effect was not included) for each probe. Each 3-way ANOVA produced a q value for each of the three effects (germ status, sex, and strain) and for each of the four interactions (germ status plus sex, germ status plus strain, sex plus strain, and germ status plus sex plus strain). For each probe, a minimum q value (q min ) was calculated for each experimental parameter (germ status, sex, and strain). For example, q min (sex) equaled the minimum of the q values that involved the sex parameter [q(sex), q(germ status plus sex), q(sex plus strain), and q(germ status plus sex plus strain)]. The data were filtered by requiring q min values to be Ͻ0.1. This resulted in 1,180 (3.28%), 574 (1.60%), and 22,758 (63.3%) probes passing the germ status, sex, and strain filters, respectively, and this corresponded to 917 (4.01%), 462 (2.02%), and 15,784 (69.0%) genes, respectively. HCAs were performed using Genesis as described above. Note that these data were not used for pathway analyses.
Dithiothreitol was added to 500 g (protein mass) of each ileum homogenate, and the samples were incubated at 60°C for 45 min to reduce protein cysteine residues (final dithiothreitol concentration ϭ 10 mM). Iodoacetamide was added to each sample (final concentration ϭ 50 mM), and the samples were incubated at room temperature for 20 min in darkness to alkylate cysteine residues. Each sample was diluted using 100 mM HEPES-NaOH (pH 8) such that the final urea concentration was 1 M. Protein digestion was performed by adding sequencing-grade modified trypsin (Promega Corp., Madison, WI) (1:100 [wt:wt] protein trypsin/sample) and incubating the samples at 37°C for 19 h. Each sample was acidified by adding formic acid (final concentration ϭ 1% [vol/vol]), and solid-phase extraction was performed using Sep-Pak C 18 columns (Waters Corp., Milford, MA).
Percolator (a component of Proteome Discoverer) was used to analyze all of the resulting peptidespectrum matches (PSMs), and the PSMs were required to have Percolator q values of Ͻ0.01 (61,62). This resulted in 577,567 PSMs with a "refined" false-discovery rate (63) of 0.0046. This corresponded to 29,954 peptide groups (ignoring peptide modifications, this corresponded to 22,573 peptides). The data were further filtered by requiring each protein group identification to correspond to either Ն2 peptide groups and Ն4 PSMs or Ն10 PSMs. This resulted in 3,088 protein group identifications (not including protein standards and common contaminants; false-discovery rate ϭ 0.018), which corresponded to 2,968 genes (gene symbols were retrieved from bioDBnet as described above). Of the 3,088 protein group identifications, 2,896 (93.8%) corresponded to Ն2 peptide groups, and 2,823 (91.4%) corresponded to Ն2 peptides (ignoring peptide modifications).
Proteome Discoverer was used to perform iTRAQ 8plex quantification relative to the quantification reference "pool" sample described above. Each protein group abundance value was equal to the median of the corresponding PSM sample/pool ratios. Proteome Discoverer was used to perform an abundance bias correction (a central-tendency normalization) using the median protein group abundance values. The resulting 39 sample normalization values (1 per ileum) ranged from 0.52 to 1.87. The geometric mean of the 39 Љlog-absolute valuesЉ was 1.216 (if a normalization value was Ͼ1 or Ͻ1, then the corresponding log-absolute value was defined as the normalization value or the reciprocal normalization value, respectively), which indicated that the mean bias was 21.6%. Six subsets of the set of 39 normalization values were each defined by association with one of the six experimental parameters (conventional, GF, female, male, C57BL/10A, and BALB/c), and the geometric mean of each subset of normalization values was nearly unity (0.98, 0.96, 1.02, 0.92, 0.93, and 1.01, respectively), which indicated that the biases were largely independent of all six experimental parameters. The raw mass spectra and the Proteome Discoverer data were deposited in the ProteomeXchange Consortium database (http://proteomecentral .proteomexchange.org) via the PRIDE partner repository (64) with data set identifier PXD003177.
The protein group abundance values were log 2 transformed, and InfernoRDN (https://omics.pnl.gov/ software/infernordn) (55) was used to perform 3-way ANOVAs. As with the transcriptomics InfernoRDN 3-way ANOVAs (described above), each ANOVA produced a q value for each effect (germ status, sex, and strain) and interaction, and the data were filtered by requiring the minimum q value (q min , calculated for each experimental parameter) to be Ͻ0.1. This resulted in 242 (7.84%), 13 (0.42%), and 539 (17.5%) protein groups passing the germ status, sex, and strain filters, respectively, and this corresponded to 219 (7.38%), 12 (0.40%), and 495 (16.7%) genes, respectively. HCAs were performed using Genesis as described above. For each protein group, the geometric mean of the protein group abundance values (normalized; not log 2 transformed) was calculated across the GF samples, the same was calculated for the conventional samples, and the ratios of the two geometric means were used as the GF/C values for the pathway analyses.
Meta-analysis. The meta-analysis used eight previously generated, publicly available gene microarray data sets described in 10 publications (12,(65)(66)(67)(68)(69)(70)(71)(72)(73). Throughout this article, these data are explicitly referred to as the "meta-analysis data set" (the "transcriptome data subset" exclusively refers to the mouse ileum experiments described above; the meta-analysis did not use any of the transcriptome data subset). All of the meta-analysis data were from experiments that used GF and conventional (including conventionalized) mice, and all of the analyses were of small-intestine or large-intestine whole-tissue samples (243 samples in total).
The data were downloaded from the Gene Expression Omnibus data repository at the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/geo/) (identifiers GSE1392, GSE5198, GSE8006, GSE18056, GSE23573, GSE32084, GSE32513, and GSE46952). The meta-analysis was performed using OMics Compendia Commons (OMiCC; https://omicc.niaid.nih.gov/) (74), a community-based online toolset for analyzing gene expression data. The results of the meta-analysis are available online (https://omicc.niaid.nih.gov/myProject/showResult/1118?acϭ0SW9E3XLH). In OMiCC, the samples were arranged into 43 intrastudy "sample groups" (groups of biological replicates; minimum of three replicates per group), and intrastudy pairs of the sample groups (each pair contained one GF group and one conventional group) were arranged into 31 "comparison group pairs." The meta-analysis tested for gene expression that was significantly affected by germ status (using only intrastudy comparisons), and it produced two q values for each gene [OMiCC meta-analyses always produce a q(up) value and a q(down) value for each gene]. For each gene, the minimum of the two q values was calculated (q min ), and the data were filtered by requiring q min values of Ͻ0.1 (2,132 genes passed this filter). HCAs were performed using Genesis as described above. For each gene, the geometric mean of the GF/C (germfree/conventional) ratios was calculated across the "comparison group pairs," and these values were used as the GF/C values for the pathway analyses.
Pathway analyses. Ingenuity pathway analysis (Qiagen) was used to perform "Core Analyses" to discover biological pathways that were significantly affected by germ status. Each Ingenuity analysis was performed using sets of genes that were significantly affected by germ status and categorized as follows: "meta-analysis," "transcriptome," "proteome, The meta-analysis, transcriptome, and proteome gene sets (2,132, 3,063, and 219 genes, respectively) each consisted of the genes that were significantly affected by germ status (q min ϭ Ͻ0.1, as described above). The concordant[T,P,M] gene set (33 genes) consisted of the intersection of the aforementioned transcriptome, proteome, and meta-analysis gene sets, and the 3 GF/C ratios (transcriptome, proteome, and meta-analysis) for each gene were required to be concordant (i.e., all 3 GF/C ratios ϭ Ͻ1, or all three ϭ Ͼ1). The concordant[T,P] gene set (88 genes) was the same except that only the transcriptome and proteome data subsets were considered. Likewise, the concordant[T,M] gene set (441 genes) was the same except that only the transcriptome and meta-analysis gene sets were considered.
Each gene in the discordant[T\P] gene set (583 genes) was significantly affected by germ status in the transcriptome data subset (q min ϭ Ͻ0.1), was confidently identified and quantified in the proteome data subset, and was not significantly affected by germ status in the proteome (q min ϭ Ն0.1) or else the transcriptome and proteome GF/C ratios were in opposite directions (i.e., one represented upregulation and the other downregulation). The discordant[P\T] gene set (118 genes) was the same but with the transcriptome and proteome data subsets inverted. Each Ingenuity analysis used one of the 30 sets of genes described above (10 "up ϩ down" gene sets, 10 "up" gene subsets, and 10 "down" gene subsets). Each Ingenuity analysis also required a reference set of genes, and each was the corresponding unfiltered (by q min ) experimental gene set (the whole-mouse genome was used for the meta-analysis analyses). For each Ingenuity analysis, a gene annotation enrichment q value (q e ) was calculated for each gene annotation (e.g., representing a biological pathway or function) by using a right-tailed Fisher exact test and subsequently applying Benjamini-Hochberg multiple-hypothesis analysis. Note that this algorithm (the Fisher exact test plus Benjamini-Hochberg test) was itself indifferent to concordancy in the direction of the GF/C values. The definition of q e is that it represents the estimated fraction of false-positive results within the discovered set of enriched annotations (the set defined by assuming that the q e value represents the significance level).
The "up ϩ down" analyses were used to discover gene annotations that were statistically significantly enriched within the set of gene products that were significantly affected by germ status. Likewise, the "up" analyses were used to discover annotations that were significantly enriched within the set of upregulated gene products and the "down" analyses were used to discover annotations that were significantly enriched within the set of downregulated gene products. For some of the annotations, the "up" or "down" analysis resulted in a q e value that was lower (i.e., represented greater statistically significance) than the corresponding q e value from the "up ϩ down" analysis. Mathematically, this is because q e is dependent on both the number of affected genes that are assigned the annotation and the number of affected genes that are not assigned the annotation. Consequently, compared to an "up ϩ down" analysis, an "up" or "down" analysis can reduce either of these numbers, and this can result in an increase or decrease in the q e value.
The sets of biological pathways were filtered by requiring q e values of Ͻ0.1 (428 pathways passed this filter in at least 1 of the 30 analyses). Pathways were classified as "upregulated" if the annotation enrichment was maximally significant by analyzing the "up" gene subset (that is, q e [up] Ͻ q e [up ϩ down] and q e [up] Ͻ q e [down] and q e [up] Ͻ 0.1), and likewise for the pathways classified as "downregulated." This definition should not be confused with the biological activation or inactivation of the pathways, as a pathway may be inactivated due to the upregulation of specific genes/proteins, or, alternatively, a pathway can be activated due to the downregulation of specific genes/proteins. Pathways were classified as "intraset discordant" (not to be confused with interset discordance such as transcriptome-proteome discordance) if the annotation enrichment was maximally significant by analyzing the "up ϩ down" gene set (that is, q e [up ϩ down] Յ q e [up] and q e [up ϩ down] Յ q e [down] and q e [up ϩ down] Ͻ 0.1).
The pathway nodes colored red and blue were significantly (q min ϭ Ͻ0.1) increased and decreased (respectively) in abundance due to germ status. The pathway nodes colored gray were not significantly affected (q min ϭ Ն0.1), and the pathway nodes colored white represent those that were not experimentally observed. Note that OMiCC performs an internal data filtration step; consequently, some meta-analysis nodes are colored white even though the corresponding genes were present in the original microarray data sets. Dual shading of a pathway node either depicted data from multiple gene products from a single data set or depicted data from two data sets (e.g., transcriptome and proteome), as indicated. For each of the 30 analyses, Ingenuity analysis was used to produce five networks of overlapping biological pathways. Each node depicted a biological pathway, and each edge depicted Ն1 overlapping gene products that were significantly affected by germ status. The minimum number of significantly affected overlapping gene products ranged between one and five. The nodes were arranged using the Ingenuity "organic layout" algorithm. Intraset concordancy P values (P c ) were calculated using two-tailed binomial tests (significance level ϭ 0.05). Prism v7.02 (GraphPad Software, Inc., La Jolla, CA) was used to perform two-tailed Mann-Whitney tests (significance level ϭ 0.05) to produce interset discordancy P values (P MW ) and also to produce scatter dot plots.
Chaperone, cochaperone, and E3 ubiquitin ligase protein-protein interactions were retrieved from the STRING database (v10.0) (http://string-db.org/) (75). Only "database" interactions (from manually curated pathways) and "experimental" interactions were used to analyze protein-protein interactions. The STRING database assigns a confidence score to every interaction and classifies interactions as low, medium, high, and very high confidence based on score thresholds of 0.15, 0.4, 0.7, and 0.9, respectively. The set of (co)chaperones (76) and the set of chaperone-associated E3 ubiquitin ligases that target misfolded proteins (48) were previously reported. This set of interactions was manually supplemented with additional data (77)(78)(79)(80)(81)(82). Cytoscape was used for network visualization (http://www.cytoscape.org/) (83). The STRING database was also used to produce transcript coexpression networks.
Data availability. We have deposited the required mass spectrometry proteomics data in the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository (ProteomeXchange submission title, "Germ-free mouse ileum iTRAQ shotgun LC-MSMS") under ProteomeXchange accession number PXD003177. The microarray data have been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO Series accession number GSE95650.