manta - a clustering algorithm for weighted ecological networks

Microbial network inference and analysis has become a successful approach to generate biological hypotheses from microbial sequencing data. Network clustering is a crucial step in this analysis. Here, we present a novel heuristic flow-based network clustering algorithm, which equals or outperforms existing algorithms on noise-free synthetic data. manta comes with unique strengths such as the ability to identify nodes that represent an intermediate between clusters, to exploit negative edges and to assess the robustness of cluster membership. manta does not require parameter tuning, is straightforward to install and run, and can easily be combined with existing microbial network inference tools.


Introduction
As most environmental covariates can only explain a small fraction of the variation in microbial communities, other factors such as species interactions have been suggested to play a large role [1]. A range of tools has become available to predict such associations. Many of these tools can predict positive as well as negative associations, with the exception of some approaches such as those based on mutual information [2]. Consequently, most microbial networks can be assigned edge weights that quantify the strength of the association. While the exact value of such edge weights may differ depending on tool usage, the sign is highly informative. Microbes can co-occur or exclude each other, as a range of ecological interactions takes place [3]. Previous work on ecological networks has demonstrated that the ratios of these interaction types can have implications for biodiversity and ecosystem stability [4][5][6].
While ecosystem stability, biodiversity patterns and nestedness have been linked to patterns of biotic interactions [5,7,8], such direct links have not been demon-strated for association networks. Unlike many ecological networks, microbial association networks suffer from interpretational challenges as they cannot be observed directly [9]. Despite these drawbacks, clusters from microbial association networks have been shown to reflect important drivers of community composition [10,11]. However, traditional choices for network clustering algorithms are unable to make optimal use of information contained in edge signs. For example, the Markov Cluster Algorithm (MCL) uses random walks based on integers to identify network clusters [12]. While this can be achieved by scaling values or by adjusting the inflation parameter, the algorithm depends on edge density to infer clusters and is therefore mostly suitable for networks with a low number of negatively-weighted inter-cluster edges.
Alternatives have been developed that are able to take edge weight into account, such as the Louvain method for community detection [13] and the Kernighan-Lin bisection algorithm [14], an algorithm that optimizes separation of the network into two parts. An entirely different approach is to scale node weights such that negatively-weighted edges are either converted to positively-weighted edges or have lower positive weights; this approach is implemented in WGCNA [15], a pipeline for network inference and clustering. Although scaling approaches result in a loss of sign information, such approaches can also yield satisfactory results.
In this work, we describe manta (microbial association network clustering algorithm), a novel flow-based method for clustering of microbial networks that can recover biologically relevant clusters. Moreover, we include a robustness metric to quantify whether cluster assignments are robust to the high error rates found in microbial association networks. Our method represents an alternative to the popular flow-based MCL algorithm [12] that in contrast to MCL can take optimal advantage of edge signs and does not need parameter optimization.

manta equals or outperforms other algorithms on synthetic data sets
To evaluate performance of manta compared to alternative methods, we generated synthetic data sets using two different approaches. One is based on the generalized Lotka-Volterra (gLV) equation, while the other was developed for the evaluation of biclustering applied to gene expression data [16] (Fig. S2).
In the right circumstances, all algorithms achieve a separation around 0.5 when clustering the two-cluster network (Fig. 1). However, this requires the use of a positive-edge only subnetwork for the Louvain method, the WGCNA unsigned approach (data not shown) and the Girvan-Newman algorithm; on the complete network, these algorithms do not separate the nodes into two clusters or fail to separate the true-positive clusters. Shifting the edge weights failed to resolve this (Fig. S14). For MCL, separation depends on the parameter settings (Fig.   S3). This algorithm can cluster the complete network, but only if the inflation parameter is set to 3 or to another uneven value. In this simulation, MCL with default parameters was unable to recover the true-positive cluster assignment (data not shown).
Strikingly, performance is improved for the Louvain method but not for the Kernighan-Lin bisection when only positively-weighted edges are considered.
While the Louvain method takes edge sign into account during its optimization, negatively-weighted edges appear to have a negative effect on separation. Finally, the results for manta with weak assignments filtered suggests that the weak assignments increase precision even in noise-free data sets.
These environmentally motivated clusters may not necessarily reflect the true positive assignments due to species interactions amplifying or obfuscating environmental effects. Therefore, we also assessed performance on simulated biclusters generated with FABIA (Fig. 2).
The FABIA simulation has characteristics that prevent some algorithms from achieving good performance. Firstly, there is no underlying interaction network from which abundances are generated. This has a distinct effect on network topology; for example, the median approximated node connectivity of the gLV networks is 1, in contrast to 55 for the FABIA networks [17]. This implies that it is much harder to fragment the FABIA networks than the gLV networks.
As a result of this change in topology, no algorithm is able to generate cluster assignments that have sparsity scores close to 1. However, not all algorithms are equally affected by the change in topology. Especially affected are those networks that assume a modular or scale-free structure, such as the Louvain method and WGCNA. The reduced performance may be attributed to one of WGCNA's core assumptions: gene regulatory networks are assumed to be scale-free. Consequently, WGCNA infers a correlation network and soft-thresholds this network by choosing the matrix power such that scale-freeness is optimized. In the same vein, the Louvain method optimizes the modularity of cluster assignments and therefore assumes a modular structure within the data set.
The high node connectivity of the FABIA networks imply that no such modular or scale-free structure is present unless negatively-weighted edges are filtered; in that case, the Louvain method is able to return cluster assignments. Hence, approaches that imply the presence of structure are sensible for the simulation with an underlying interaction network, but may not be optimal for the FABIA biclusters. Neither manta nor the Kernighan-Lin method make such assumptions. As with the environmentally-motivated simulation, Kernighan-Lin bisection achieves some of the best results on this simulation. Hence, if users suspect based on a preliminary analysis that their data set contains only two clusters, this algorithm is likely to recover that separation regardless of the underlying structure. However, manta has the advantage of a tunable weak cluster assignment that can handle noisier data and less accurate networks and can handle data with more than 2 clusters. For MCL, parameter settings were not optimized on the FABIA data set. This dependency on parameter selection becomes apparent on data with a different structure, as the algorithm fails to recover clusters on positive-edge-only networks. and for networks generated from data with added multinomial noise (Fig. S13).

manta identifies biologically relevant groups in cheese rinds
We demonstrate the real-world applicability of manta on a cheese rind data set generated by Wolfe et al. [18]. In this study, the authors analyzed 137 cheese rind communities and identified important community members. Moreover, they found that community assembly of cheese rind communities was highly reproducible, despite the large geographical distances between cheeses. This can be explained at least partially by manipulation of the rind biofilm, as cheesemakers can introduce an initial community through starter cultures and then control the environment during the aging process. In their study on cheese rinds, Wolfe et al. [18] originally demonstrated that most of the community variation could be explained by the rind type. Indeed, most samples appear to cluster by rind type (Fig. 3B). The authors also found that samples from washed cheeses could cluster closely with both other types of rinds; the principal coordinate analysis also captures this phenomenon, as samples from washed cheeses are dispersed across the entirety of the axes.
We ran manta on the association network to assess whether cluster analysis would be able to recapitulate some of the drivers of community structure in the cheeses (Fig. 3A). The network visualization of the data reveals some interesting trends, as the network contains three clusters that correlate to moisture and taxonomy.
Cluster 1 is mostly comprised of Firmicutes and its summed abundances have a strong negative correlation to moisture. In fact, several of these taxa belong to the genus Staphylococcus, replicating the results by Wolfe et al. [18] as they demonstrated that Staphylococcus sp. are abundant on dry natural rinds. In contrast, cluster 1 mostly consists of Proteobacteria and correlates positively with moisture.
While the clusters correspond well to the results obtained by Wolfe et al. [18], manta is able to offer additional insight in community structure through its separation of Actinobacteria. Wolfe et al. [18] demonstrate that abundance of taxa belonging to this phylum is negatively correlated to moisture. However, the clusters indicate that this correlation is more nuanced. Some of the taxon abundances may reflect a gradual response to moisture rather than a strict preference for dry or moist cheese rinds, as summed abundances for Actinobacteria belonging to cluster 2 and 3 display a weak and non-significant correlation to moisture rather than the strong negative correlation associated to cluster 1.
On this data set, manta identifies clusters that correspond well with the main drivers of community composition in this study, while also identifying taxa that display intermediate responses to these drivers. Hence, this case study demonstrates that cluster analysis can yield novel insights into community structure.

Clustering global trends in coastal plankton communities
One advantage of manta is its ability to handle networks generated through any type of inference algorithm (though conversion to undirected networks is necessary in some cases). We demonstrate this through a time series analysis of coastal plankton communities [19]. Martin-Platero et al. [19] collected samples for 93 days and used this data to demonstrate that the communities changed rapidly, but only when lower taxonomic ranks were taken into account. The authors demonstrate WaveClust on this data set, a novel clustering method based on wavelet analysis of longitudinal data. Consequently, WaveClust can find taxon associations at low and high frequencies that are not visible without a frequency decomposition. The authors evaluated their technique on data from coastal plankton, where clusters identified by WaveClust corresponded to rapid growth of specific groups of taxa. A network analysis of the Granger causalities between clusters and metadata further demonstrated that the clusters could be separated into two regimes, both corresponding with an initial warm period followed by a rapid or gradual cooling period. These regimes are also visible in the ordination plot, as the samples separate over time around day 215 (Fig. 4B).
We generated a network with eLSA and clustered this with manta (Fig. 4A). manta is not designed for time series and therefore not able to detect frequencyspecific associations. However, manta does summarize larger trends in the data. To demonstrate this, we first computed a network with eLSA, a microbial network inference algorithm that takes temporal shifts into account [20,21]. A force-directed layout reveals that the eLSA network contains a large number of anti-correlated nodes. These clusters correlate with multiple environmental variables, thereby closely reflecting the two meta-regimes identified by [19]. The vector for abundances of taxa belonging to cluster 1 aligns closely with the chlorophyll concentration and wave height; in contrast, the total abundance of taxa belonging to cluster 0 was highest in the start of the experiment, when the seasonal warm period was still ongoing. The vector for taxa that could not confidently be assigned to a cluster was also significant and is directly opposite of the silicate concentration. The silicate concentration is not part of the originally reported Granger causality model [19], but it does increase sharply around the transition point between the two meta-regimes (days 240-260). Hence, the weak assignments capture a set of taxa that become less abundant during the transition period and do not correlate strongly to taxa in the assigned clusters.

Discussion
The manta algorithm is able to perform as well as or better than pre-existing algorithms available for network clustering, while its ability to identify weak cluster assignments can assist in defining cluster cores in networks that represent a gradient rather than distinctly separated clusters. Our case studies demonstrate how cluster assignments recapitulate main drivers of community composition.
A key limitation of manta is its inability to deal with networks with only a few negative edges. In such cases, cluster assignments mostly separate central nodes from peripheral nodes. However, manta was specifically developed for weighted networks; when only a few edges have negative weights, treating the network as unweighted or removing negatively-weighted edges will not affect network structure significantly. As demonstrated by the separation scores on positiveedge only networks, algorithms like MCL and the Louvain method will then perform adequately. Moreover, the clusters assigned by manta can accurately be described as 'the enemy of my enemy is my friend'; while there are algorithms available that require nodes to be similar to nodes within the community [22], manta has no such requirements and may assign nodes to a cluster even though they are not necessarily positively associated to other nodes within that cluster.
Additionally, users should be aware that cluster assignments will only reflect main drivers of community composition, as manta tends to generate a small number of clusters separated by weakly assigned nodes. Separating a data set by its main drivers (e.g. sample type or location) can help identify more interesting clusters.
Although we chose to evaluate manta in the context of microbial networks in this manuscript, manta may also be useful for clustering other types of networks with a large number of negative edges. For example, even though WGCNA was originally developed for gene expression data [15], it has also been used for microbial data [10]. The ability of manta to identify clusters without any underlying topological structure implies that it may be especially valuable in contexts where a small-world or scale-free structure cannot be expected. Moreover, its lack of sensitivity to parameter settings in this simulation demonstrates that manta is applicable in situations where little is known about the structure of the analyzed network ( Supplementary Material Fig. 6).

Clustering by graph traversal
We can represent a graph as an adjacency matrix, with each node in the graph represented as a row and column in the matrix. A non-zero entry in row i and column j represents an edge between node i and node j. In the case of an undirected graph, the adjacency matrix is symmetric. The adjacency matrix can be converted to a stochastic matrix by normalizing the values so each column sums to one. Raising the stochastic matrix A to a power n through a matrix product corresponds to the probabilities for random walks of length n to cross each position in the matrix. However, since this loses the sign information, we adopted a different approach that does not convert to a probability matrix.
Here, we define a scoring matrix, initialized from the weighted adjacency matrix.
In this case, each position in the matrix corresponds to the edge weights. Crucially, the matrix product of the weighted adjacency matrix can take on negative values, given an even n. The product can then be scaled by dividing all values by the largest absolute value in the matrix (Equation 1).
where • M is the scaled scoring matrix • A w is the weighted adjacency matrix • n is the walk length, by default 2 In the MCL algorithm, the separating effect of the matrix operation is boosted by raising each matrix entry to a power greater than one; afterwards, the matrix is rescaled again to acquire a column stochastic matrix (each column sums to 1) [12].
In contrast, we designed the algorithm preserve positive and negative values: every non-zero value m of scoring matrix M has its multiplicative inverse added to it (Equation 2). After inflation, the matrix is normalized again by dividing by the largest absolute value in the matrix. We also tested other variants, i.e.
raising each element to a power or taking the root; unlike the multiplicative inverse, no variant was able to capture the cluster structure. While this step may seem counter-intuitive as it reverses differences rather than reinforcing them, values converge to -1 and 1 each time for the toy model (Fig. 5B).
where • m is an element of scoring matrix M after inflation before normalization • l is the value of the element before inflation The stopping condition for the algorithm is defined as a threshold ε for the average of E (Equation 3). The matrix of fractional differences E is generated from the scoring matrices of the current and the previous iterations; zero values are filtered out to prevent issues with dividing by zero.
where • E is the error matrix where • s is the sparsity score While these flip-flop states represent rare cases when MCL is applied, the use of signed graphs by manta strongly increases the probability of these states appearing. For example, no convergence could be observed for larger matrices such as the simulated co-occurrence network (Fig. 5C).
This relates to the notion of balance in signed graphs [32],

Weak assignments and robustness
The subsetting strategy is complemented by additional information generated the product δ of the scaled edge weights is calculated. The average of these products is then the mean edge product (Equation 5).
where • P is the mean edge product • v is a node • t is an oscillator • n is the set of shortest paths from v to t • δ is the product of weights from one shortest path i The node is considered to have a weak assignment if P meets one of the following criteria: • The sign of P v,t is not positive for v assigned to the same cluster as t.
• P is smaller than a user-specified threshold.
The weak assignment does not check for nodes that belong to two clusters, but rather filters out nodes that could not be assigned to any cluster, given that the shortest paths were in conflict with the node's positions in the scoring matrix.
In addition to identification of weakly assigned nodes, we developed two new robustness measures to identify robust nodes and clusters (Fig. 5E). These measures are inspired by the reliability metric developed by Frantz and Carley [35] and highlight parts of the clustering outcome that are disproportionately sensitive to errors in network inference. Both measures are generated from rewired and are reported as a confidence interval of Jaccard similarity coefficients. For the cluster-wise robustness, Jaccard similarity scores are computed for the original clusters and their best permuted matches (identified through the maximum Jaccard similarity coefficient). This demonstrates how sensitive cluster compositions are to errors in network inference. In contrast, node-wise robustness does not use the best-matching clusters but computes Jaccard similarity coefficients for all clusters that contain the node in question.
These confidence intervals provide two different types of information. Firstly, a low Jaccard similarity coefficient indicates that a cluster has a different composition given a few errors, or that a node is assigned to a cluster with an entirely different composition given a few errors. Secondly, the width of the confidence interval demonstrates how variable cluster assignments are; wide intervals for node-wise robustness demonstrates that the node sometimes ends up in a similar cluster, but can also be assigned to a different cluster.

Synthetic data sets
We carried out two types of simulations with 50 replicates per simulation. For the first type, species interaction networks with a connectivity of 5% were generated with the R package seqtime (version 0.1.1) . The effects of environmental factors on growth rates were sampled from a normal distribution with μ=1. The strengths of these factors were sampled from a normal distribution with μ=3. To ensure that the cumulative effects of the environmental factors were unique to each condition, one factor was set to be positively weighted while the remaining two were converted to negative values. Data sets were then generated with the generalized Lotka-Volterra equation, with growth rates of each organism adjusted per environmental condition. As some interaction matrices caused population explosions, these were re-generated until enough matrices were available for which the generalized Lotka-Volterra equation could be solved. For details and source code on these data sets, we refer to [9]; note that in contrast to this work, we did not enforce a scale-free structure in the interaction network. We generated densely connected networks from the simulated species abundances with the Pearson correlation and filtered these for significance (α=0.05). If the absolute value of the largest growth rate for one environmental condition divided by the mean growth rate was larger than 0.5, the species was assigned a specific cluster identity. This ensures that cluster assignments are not punished for failing to assign species barely affected by the environmental factors.
The second type of simulation uses an adapted version of the FABIA R package [16], a package that generates simulated data for evaluations of biclustering algorithms. We adapted the makeFabiaDatablocksPos function to generate biclusters at set locations and used these locations to define true positive clusters.

Clustering algorithms
We compared manta to the Louvain method [13], MCL [12], WGCNA [15], the Girvan-Newman method [36] and Kernigan-Lin bisection [14]; for an overview of properties relevant to this manuscript, we refer to Table 1 and the Supplementary Material. We used the following implementations of the algorithms: • WGCNA: blockwiseModules function (version 1.66) [15] • 3-6). We set these parameters as follows: • manta: ratio set to 0.8, edgescale to 0.3 • MCL: On complete networks, inflation was set to 3 and expansion to 15.
On positive-edge-only networks, inflation was set to 2 and expansion to 7.
• Louvain method: On complete networks, resolution was set to 0.1. On positive-edge-only networks, resolution was set to 1.
Cluster assignments were evaluated as described by [37]. We report both the complex-wise sensitivity (Sn), the cluster-wise positive predictive value (PPV), geometrical accuracy (Acc) and the separation (Sep) (Fig. S1). We also report the sparsity score (Equation 4). When a cluster size exceeded 80% of the total number of assigned species, all measures except the sparsity score were replaced with missing values to avoid obfuscation of the results. The opposite case -when nearly all species were assigned to their own clusters -was also replaced when the number of assigned clusters exceeded 50.

Cheese rind case study
We downloaded BIOM-formatted bacterial abundances of this cheese data set (study ID 11488, [18]) from Qiita [38]. The data set was preprocessed by removing samples with fewer than 10000 counts, rarefying to even depth and filtering taxa with less than 20% prevalence. The final data set therefore included 97 taxa and 337 samples. We used these abundances to construct a network with CoNet; an initial multigraph was constructed with Pearson correlation, Spearman correlation, mutual information, Bray-Curtis dissimilarity and Kullback-Leibler dissimilarity. Only edges that were supported by at least 2 methods were retained.

Longitudinal coastal plankton case study
We downloaded the supplementary files from [19]. Taxon abundances were first agglomerated at genus level; afterwards, we removed samples with fewer than 2000 counts and rarefied to even depth. Taxa with a sample prevalence below 30% were removed, with the total counts preserved by including them in a bin. For an additional analysis of the clusters, a principal coordinate analysis of the metadata with Bray-Curtis dissimilarity was carried out. Technical replicates were averaged and the metadata features supplied by [19] were used to fit environmental vectors onto the ordination. Significance of these vectors was assessed through permutation testing (1000 permutations), and only vectors with a p-value below 0.05 (after Benjamini-Hochberg multiple testing correction) and qvalue below 0.05 were retained. To compare the direction of cluster abundances to these vectors, all bacterial abundances were summed and the covariance between these abundance vectors and principal component vectors estimated.
These covariance values significantly exceeded covariances of permuted bacterial abundances (1000 permutations, p-value 0.001) and were used to construct cluster abundance vectors; an arbitrary scaling coefficient was used for visualization purposes.

Data Availability
All code for manta is available under the Apache-2.0 license at https://github. com/ramellose/manta. An archived version of manta, together with code and synthetic data used for writing the manuscript, is available from https:// zenodo.org/record/3407044#.XXtTcmZx1EY.

Acknowledgements
We would like to acknowledge Didier Gonze for helpful feedback on the balance of signed matrices. This work was financially supported by KU Leuven.

Competing Interests
The authors declare no competing interests. Overview of different clustering algorithms. Different properties of manta, WGCNA, MCL, Louvain community detection, the Girvan-Newman algorithm and the Kernighan-Lin bisection algorithm. The following properties are summarized: how algorithms choose a cluster number, whether they can leave nodes unassigned, whether they perform better with negatively-weighted edges removed and what types of networks they accept. Finally, we assessed whether algorithms required extensive parameter tuning before achieving optimal performance on simulated data. Figure 1: Performance of network clustering tools on two environmentally motivated clusters. Clustering performance was estimated on 50 independently generated data sets generated from random interaction matrices. Sensitivity (Sn), positive predictive values (PPV), accuracy (Acc) and separation (Sep) were calculated as described by [37]. Sparsity of the assignment is a function of the edge weights of intra-cluster versus inter-cluster edges (Equation 4). The numbers next to the sensitivity results indicate how many cluster assignments met the following criteria for a particular algorithm: no cluster should exceed 80% of the total number of nodes, and there should be fewer than 50 clusters. The manta algorithm was run with and without weak assignments, while WGCNA was run with signed networks and a signed topological overlap matrix and with unsigned networks combined with the unsigned matrix. For all other algorithms, we provided the complete network in addition to the positive edge-only network (indicated with +).

Figure 2:
Performance of network clustering tools on two biclusters generated with FABIA [16]. Clustering performance was estimated on 50 independently generated data sets without an underlying topology. Sensitivity (Sn), positive predictive values (PPV), accuracy (Acc) and separation (Sep) were calculated as described by [37]. Sparsity of the assignment is a function of the edge weights of intra-cluster versus inter-cluster edges (Equation 4). The numbers next to the sensitivity results indicate how many cluster assignments met the following criteria for a particular algorithm: no cluster should exceed 80% of the total number of nodes, and there should be fewer than 50 clusters. The manta algorithm was run with and without weak assignments, while WGCNA was run with signed networks and a signed topological overlap matrix and with unsigned networks combined with the unsigned matrix. For all other algorithms, we provided the complete network in addition to the positive edge-only network (indicated with +).  Figure 3: Network analysis of a cheese data set [18]. A CoNet network clustered with manta. Cluster identity is encoded in node shape, whereas node colour represents phylum membership and node border width reflects whether manta assigned weak cluster membership. The edge colour is mapped to the sign of the association. B Principal Coordinate Analysis (PCoA) of Bray-Curtis dissimilarities for sample compositions.
The colours indicate the different cheese rind types: bloomy cheeses are inoculated with fungi, while washed cheeses are repeatedly washed with a brine solution. In contrast, natural rinds are not disturbed during aging. C Spearman correlation of moisture to summed taxon abundances. Correlations for the phylum Actinobacteria are highlighted with blue.  Figure 4: Network analysis of longitudinal 16S data collected from coastal plankton [19]. A eLSA network clustered with manta. Cluster identity is encoded in node shape, whereas node colour represents phylum membership. The edge colour is mapped to the local similarity score. B Principal Coordinate Analysis (PCoA) of Bray-Curtis dissimilarities for sample compositions, overlaid with environmental vectors and cluster abundance vectors. Significance of these vectors was assessed through permutation testing; only significant vectors are shown. Cluster abundance vectors were scaled independently from environmental vectors; abundances of taxa that could not be assigned to a cluster were included in the 'Cluster weak' vector. The axis values are the eigenvalues of the ordination axes. D manta uses agglomerative clustering on the scoring matrix to assign each node to a cluster. For flip-flopping matrices, the scoring matrix is generated from subsets of the complete network. E A fraction of the original network is rewired to generate permuted cluster assignments with identical degree distributions. Robustness of cluster assignments can then be estimated by comparing the Jaccard similarity of cluster memberships clusterwise or node-wise.