The limitations of correlation-based inference in complex virus-microbe communities

Microbes are present in high abundances in the environment and in human-associated microbiomes, often exceeding one million per milliliter. Viruses of microbes are present in even higher abundances and are important in shaping microbial populations, communities, and ecosystems. Given the relative specificity of viral infection, it is essential to identify the functional linkages between viruses and their microbial hosts, particularly given dynamic changes in virus and host abundances. Multiple approaches have been proposed to infer infection networks from time-series of in situ communities, among which correlation-based approaches have emerged as the de facto standard. In this work, we evaluate the accuracy of correlation-based inference methods using an in silico approach. In doing so, we compare predicted networks to actual networks to assess the self-consistency of correlation-based inference. At odds with assumptions underlying its widespread use, we find that correlation is a poor predictor of interactions in the context of viral infection and lysis of microbial hosts. The failure to predict interactions holds for methods which leverage product-moment, time-lagged, and relative-abundance based correlations. In closing, we discuss alternative inference methods, particularly model-based methods, as a means to infer interactions in complex microbial communities with viruses. 2 Importance Inferring interactions from population time-series is an active and ongoing area of research. It is relevant across many biological systems – in particular in virus-microbe communities, but also in gene regulatory networks, neural networks, and ecological communities broadly. Correlation-based inference – using correlations to predict interactions – is widespread. However, it is well known that “correlation does not imply causation”. Despite this, many studies apply correlation-based inference methods to experimental time-series without first assessing the potential scope for accurate inference. Here, we find that several correlation-based inference methods fail to recover interactions within in silica virus-microbe communities, raising questions on their relevance when applied in situ.

inference in diverse communities of microbes and viruses. As we show, correlation-based inference fails to recapitulate virus-microbe interactions and performs worse in more diverse 99 communities. The failure of correlation-based inference in this context raises concerns over 100 its use in inferring microbe-parasite interactions as well as microbe-predator and microbe-101 microbe interactions more broadly.
where H i and V j refer to the population density of microbial host i and virus j respectively. 108 There are N H different microbial host populations and N V different virus populations. For 109 our purposes, a "population" is a group of microbes or viruses with identical life history 110 traits, that is microbes or viruses which occupy the same functional niche.

111
In the absence of viruses, the microbial hosts undergo logistic growth with growth rates Each virus j's population grows from infecting and lysing their hosts. The rate of virus 120 j's growth is determined by its host-specific adsorption rate φ ij and host-specific burst size  and modularity (Fig 1). We first generate the maximally nested ( Fig 1A) and maximally 134 modular ( Fig 1B) networks of size N using the BiMat Matlab package [42]. In order to 135 achieve maximal nestedness and modularity, the network fill F (fraction of interacting pairs) 136 is fixed at F = 0.55 for the nested networks and F = 0.5 for the modular networks. For the 137 modular networks, the number of modules is set to 2, 5, and 10 for the three network sizes 138 respectively.

139
To generate networks that vary in nestedness and modularity, we perform the following 140 "rewiring" procedure. Beginning with the maximally nested or maximally modular net- nestedness metric used is NODF [43], and the algorithm used to calculate modularity is

148
AdaptiveBRIM [44]. The modularity is additionally normalized according to a maximum 149 theoretical modularity as detailed in [45]. The life history traits for a given interaction network are chosen to ensure that all mi-152 crobial host and virus populations can coexist, adapted from [46].

153
First we sample target fixed point densities H * i and V * j for each microbial host and 154 virus population. In addition we sample adsorption rates φ ij and burst sizes β ij . All of 155 these parameters are independently and randomly sampled from uniform distributions with 156 biologically feasible ranges specified in Table 1. We use a fixed carrying capacity density for each microbial host i. To start, we set all intraspecific competition to one (a ii = 1) and 164 all interspecific competition to zero (a ii = 0 for i = i). Then for each microbial host i we  δ can be used to tune the amount of variability in the simulated time-series (see Fig S1).

179
After simulating virus and microbe time-series, we sample the time-series at regularly 180 spaced sample times (every 2 hours) for a fixed duration (200 hours, or 100 samples). There-181 fore, for each virus and each microbe in the community we take S samples at times t 1 , . . . , t S .

182
We use the same sampling frequency and the same S for each inference method, except for 183 time-delayed correlation (see §4.5). We assume S regularly spaced sample times t 1 , . . . , t S for each host type H i and each 186 virus type V j . The samples are log-transformed, that is h i (t k ) = log 10 H i (t k ) and v j (t k ) = 187 log 10 V j (t k ) for each sampled time-point t k . The standard Pearson correlation coefficient 188 between host i and virus j is then    [29,30,31]. 220 We applied eLSA to our simulated time-series data. We used samples of all N H host types 221 and all N V virus types with S regularly spaced sample times t 1 , . . . , t S as input. We used the  this assumption [40].
We applied SparCC to our simulated time-series data as a means to evaluate correlation- for the hosts and by for the viruses. We used the normalized N H host and N V virus samples as input for the   . Each plotted point corresponds to a unique in silico community. Dashed line marks AUC=1/2 and implies the predicted network did no better than random guessing.

Correlation-based methods eLSA and SparCC
We performed a similar in silico analysis using eLSA [29,30,31] and SparCC [40], two  and network structure. We found that correlation-based inference performed poorly given 379 variation in these network properties, but that there was greater variation in performance for 380 small networks. Because this variation is relatively small and disappears for larger networks, 381 successful predictions for small networks may themselves be spurious. Namely, for a small 382 network (e.g. N < 10), there is a greater probability of randomly guessing the interactions 383 correctly because the space of possible networks is smaller.

384
Our results raise concerns about the use of correlation-based methods on in situ datasets, 385 since a typical community under consideration will have dozens or more interacting strains 386 and therefore will not be in the low diversity microbe-only regime explored by [25]. Ad-387 ditional challenges such as external environmental drivers, measurement noise, and system 388 stochasticity must also be carefully considered before applying correlation-based methods to 389 in situ datasets. Although the degree of variability of dynamics had no effect on inference best describes the sampled community time-series (for example, see [41,39,50,49,51,52]