Coevolutionary Analysis Reveals a Conserved Dual Binding Interface between Extracytoplasmic Function σ Factors and Class I Anti-σ Factors

In the bacterial world, extracytoplasmic function σ factors (ECFs) are the most widespread family of alternative σ factors, mediating many cellular responses to environmental cues, such as stress. This work uses a computational approach to investigate how these σ factors interact with class I anti-σ factors—the most abundant regulators of ECF activity. By comprehensively classifying the anti-σs into phylogenetic groups and by comparing this phylogeny to the one of the cognate ECFs, the study shows how these protein families have coevolved to maintain their interaction over evolutionary time. These results shed light on the common contact residues that link ECFs and anti-σs in different phylogenetic families and set the basis for the rational design of anti-σs to specifically target certain ECFs. This will help to prevent the cross talk between heterologous ECF/anti-σ pairs, allowing their use as orthogonal regulators for the construction of genetic circuits in synthetic biology.

the 2 and 4 domains, while RsiW wraps around the two domains of SigW (Bacillus subtilis). Furthermore, while the three-helix bundle of some ASDIs require zinc coordination for ECF inhibition (ChrR Rsp [13]), another structure features a zinc-binding motif but binds the ECF independently of zinc (RsiW B. subtilis [14]), and others do not rely on a zinc-binding motif at all (RskA Mtu , RseA Eco [15,16]). Thus, it is tempting to speculate that the divergent binding topologies between ECFs and ASDIs could be important to prevent cross talk between ECF/ASDI pairs of different ECF groups and that these conformations might be conserved within ECF groups. If so, we reasoned that protein sequences of ECF and ASDI proteins have coevolved and that ASDI protein sequences should cluster into phylogenetic groups similar to the ECF groups.
Due to the diversity in ECF/ASDI quaternary structure, we here wondered whether there is a minimal contact interface conserved across all members of the ASDI family. To predict amino acid residues involved in such conserved contact interfaces, we turn to direct coupling analysis (DCA)-a bioinformatic method that exploits evolutionary covariation to predict contacting residues (17). When two residues interact, mutations in one need to be compensated by changes in the second so as to preserve the interaction (17). The same mechanism also applies for indirect contacts; however, DCA is able to distinguish direct from indirect interactions and considers only the former for the calculation of their covariation score (17). One of the highlights of DCA is that aside from stable conformations, it can also provide information on the transient, unstable conformations that occur during the dynamic process of interaction (18).
In this study, we set out to provide the first phylogenetic classification of ASDI proteins and reveal striking patterns of coevolution between these regulators and their cognate ECF factors. For the ECF/ASDI interaction, we used DCA to predict the residues that form the core ECF/ASDI contact. The arising sequence logos show divergent use of residues across ASDI groups, thus explaining the low binding affinity of noncognate ECF/ASDI pairs from different groups. However, the predicted interaction partners in the fourth helix of ASDI and their respective counterparts are less conserved even within the ASDI groups. This might explain how ASDI proteins maintain binding specificity even within ASDI groups. These results allow a first, in silico assessment of potential cross talk between two ECF/ASDI pairs without expensive in vivo testing, opening new ways to rationally design synthetic circuits using orthogonal ECF/ASDI pairs.

RESULTS
ASDI retrieval and classification. We focused on the class I anti-factors (ASDIs) as the main regulators of ECF factors, in order to gain a better understanding of their general binding mechanism for ECFs. Given that anti-factors are often coencoded with their ECF targets (1,2,13,19,20), we first set out to collect ASDIs from the genetic neighborhood of the ECF coding sequences. To this end, we focused on a set of 21,047 putative antifactors identified during a recent classification effort for ECF factors by our group (3). To identify ASDI-containing proteins from this data set, we used hidden Markov models (HMMs) developed from a small data set of both zinc-binding and non-zinc-binding ASDIs published earlier by Staroń and colleagues (2) (see Materials and Methods for details). This step yielded 7,490 proteins, showing that ϳ36% of all putative antifactors are ASDIs. In order to further expand the size of the ASDI sequence library, we built a new extended HMM from the ASDI domain of these sequences. We used this extended model to search for ASDIs in the genetic neighborhood of all classified ECFs identified in reference 3, using only the 33,843 ECFs from representative and reference organisms as labeled by the National Center for Biotechnology Information (NCBI). This yielded 11,939 proteins, from which we removed the ones with ASDIs shorter than 50 amino acids, since these could be divergent class II antifactors (21). The final number of ASDIs retrieved by this pipeline was 10,930, of which 10,806 have a nonredundant anti-domain. This shows that, on average, about one-third (ϳ32%) of the ECF factors contain a protein with an ASDI domain in their genetic neighborhood, suggesting that ASDIs are the most widespread regulators of ECF activity known to date. The average size of the ASDI domain was 101 Ϯ 33 (standard deviation) amino acids.
To gain an idea about the evolutionary history of these regulators, we classified them according to sequence similarity of their ASDI domain into 1,475 clusters of closely related sequences. For that, we used a divisive strategy, where the pool of sequences was subjected to a bisecting K-means clustering algorithm until the maximum k-tuple distance among sequences in the cluster was smaller than an empirical threshold of 0.6 (see Materials and Methods). Then, the consensus sequences of these clusters (referred to as subgroups here) were hierarchically clustered into a phylogenetic tree (Fig. 2). A simple inspection of the ASDI tree shows that neighboring ASDI sequences on the tree also regulate factors from the same ECF group (Fig. 2, ring 2), supporting the notion that ECFs and ASDIs coevolved. Given the similarity between ECF and ASDI classifications, we split the ASDI tree into monophyletic groups that regulate factors from the same ECF group (Fig. 2, ring 1). This split usually agrees with high bootstrap values ( Fig. 2; see also Fig. S1 in the supplemental material), suggesting that this definition of ASDI groups is robust to changes in the data set. As a result, ASDI groups were named with "AS" followed by a number dependent on the ECF group with which they are found in genomic proximity. Even though ASDIs from the same clade of the ASDI tree are usually coencoded with (and thus likely regulate) members of the same ECF group, some ASDI groups have slightly divergent features and are located in different clades of the ASDI tree. Two of these ASDI groups are, for example, AS19-1 and AS19-2, which regulate members of ECF19 (Fig. 2), but they are divergent in their ASDI helix 1 (consensus motif HTLAGAYALDAL in AS19-1 versus HLDPDQLALLA in AS19-2) and helix 2 (consensus motif LDDERAAFERHL in AS19-1 versus GEPLDADERAHL in AS19-2). Given that group AS19-2 is more closely related to AS27 than AS19-1, this suggests that these groups may have independently evolved the ability to bind to ECFs of group 19.
ASDIs that regulate ECFs from the same subgroup are usually located together in the tree but split into distinct ASDI subgroups (data not shown), probably due to the larger sequence diversity of antifactors than of ECFs. We observed that, even though there was a mixture of zinc-binding and non-zinc-binding ASDIs in the input data set (as indicated by the presence or absence of an "Hx 3 Cx 2 C" motif), both types distribute across the ASDI tree, generating ASDI groups that are mixtures of zinc-and non-zincbinding proteins, such as AS19-1 and AS27 (Fig. 2). Exceptions are groups AS33-1 and AS33-2, whose difference is the presence or absence of the zinc-binding domain, respectively (Fig. 2, ring 3).
Additionally, we predicted the mode number of transmembrane helices (TMHs) in the different ASDI subgroups using the consensus prediction from online TopCons (22). Most of the full-length anti-factor sequences (ϳ65%) are predicted to contain at least one TMH, suggesting that they are bound to the membrane, while the remaining ones (35%) are likely soluble antifactors. Although the whole data set of ASDIs is composed of similar amounts of zinc-(ϳ56%) and non-zinc-binding (ϳ44%) proteins, we observed that among the soluble ASDIs there was an overrepresentation (ϳ72%) of sequences with a zinc-binding motif. This is consistent with the notion that cytoplasmic ASDIs are often involved in sensing intracellular redox conditions (13,23,24). The membrane-bound anti-factors contained ϳ48% of sequences with a zinc-binding motif, contrasting with earlier observations that membrane-bound anti-factors showed an underrepresentation of zinc-binding domains (13). However, the data in this earlier work were based on a much smaller sequence data set of only 1,266 sequences (13), suggesting that this apparent bias may have been due to random sampling of the sequenced genomes at the time. Our finding of an approximately equal distribution of zinc-and non-zinc-binding motifs in the membrane-bound ASDIs indicates that the Zn-binding motif could be playing a nonsensing role, e.g., by taking a more static, structural function as is the case for RsiW in B. subtilis (14).
If the Zn-binding motif does not play an active sensory role, the general notion is that the ASDI domains have associated with additional protein domains that allow stimulus perception and ultimately trigger antifactor release (25,26). To assess the conservation of additional protein domains, usually located C terminal of the ASDI domain, we scanned full-length class I antifactors with Pfam 31.0 models (27) as well as the extended model of the ASDI domain. When indicating the positions of these domains in the different class I antifactor subgroups (Fig. 2, ring 4), we found that the protein domains associated with ASDIs are typically well conserved for ASDIs from the same group but differ between groups. This suggests that ASDIs regulating members of the same ECF group are likely sensing similar input cues, by binding directly either to the triggering molecule or to other sensory proteins. The full list of ASDIs, together with their partner ECF and their ASDI group and subgroup, can be found in Table S1.
Given the ample degree of correlation between ECF and ASDI classifications, we evaluated whether these families coevolved. For this, we calculated the Pearson correlation coefficient (PCC) of the pairwise distance matrices of ASDIs and ECFs, as described by Goh et al. (28), leading to a PCC of 0.82. In order to determine the significance of this correlation coefficient, we adopted the strategy of Dintner et al. (29) and included as negative controls RsbW-like anti-factors and RpoD-like factors, which do not interact with ECFs and ASDIs, respectively: RsbW is the anti-factor of the alternative factor B and a protein kinase of the anti-anti-factor RsbV in Bacillus subtilis (30), while RpoD is the housekeeping factor of Escherichia coli (31). For these negative controls, we obtained low PCCs (0.5 to 0.6), which are similar to the ones obtained by Dintner et al. for negative controls in bacterial two-component systems (29). This indicates that the PCC of 0.82 obtained for the correlation between ECFs and ASDIs is highly significant, showing that there has been strong coevolution between these protein families ( Table 1).
A taxonomic analysis of the ASDI protein family (Fig. S2A) further shows that ASDI groups are often composed of sequences originating from a single bacterial phylum, e.g., for AS02 (Proteobacteria), AS245 (Firmicutes), or AS12 (Actinobacteria), consistent with the observation that some ECF groups are phylum specific (3). While most other ASDI groups are also typically composed of sequences from a dominant phylum, they often contain a few ASDI subgroups from other phyla, e.g., for AS11, AS26, or AS243, suggesting that these subgroups may have resulted from horizontal gene transfer. Another interesting observation is that while all ASDIs from AS12 are found in Actinobacteria, the regulated ECFs make up only 89% of the sequences in the ECF12 group (3). The other ECF12s are found mostly in Bacteroidetes (ϳ7%) and Proteobacteria (ϳ2%), and the fact that they do not feature an ASDI in their genomic neighborhood indicates that they either are regulated by orphan ASDIs or have adopted another mode of regulating ECF activity (3). In fact, when more closely examining the phylum-specific frequency of ASDIs in the genomic neighborhood of ECFs (Fig. S2B), we found that in Bacteroidetes only 6% of ECFs are associated with ASDIs, which is significantly lower than the average of 32% found across all phyla. This is consistent with the observation by Staroń et al. (2), who noted that the Bacteroidetes group of RpoE-like proteins (ECF03) also lacks a conserved anti-factor in their genomic context, suggesting again either that gene synteny is broken or that other modes of ECF regulation, e.g., via other anti-classes, are dominant in this phylum. Other phyla, in contrast, feature a strong overabundance of ASDI-associated ECFs, such as the Gemmatimonadetes (ϳ79%) or the Chloroflexi (72%), but further studies are needed to identify the origin of this taxonomic bias.
DCA predicts two main contact interfaces between ASDIs and ECFs. Given the variability in the binding conformations in the four published ECF/ASDI cocrystal structures, we next wondered whether there exist universally conserved "core-binding interfaces" that are shared within the whole family of ASDI proteins, or whether the strong coevolution between the protein families gave rise to fundamentally different binding conformations. To identify potentially conserved contact interfaces, we sought to exploit the coevolutionary information between our ASDI data set (above) and the ECF classification (3). Specifically, we aimed at predicting amino acid residues on ASDIs and ECFs that display significant covariation, suggesting that they are in direct contact and that the mutation in one residue is balanced by a compensatory mutation in its binding partner. To this end, we applied direct coupling analysis (DCA) (17) to the full set of ASDIs and their cognate ECFs (Table S1). The results of this analysis revealed a large amount of high DCA scores within the 2 and 4 domains of the ECF factor and also connecting the two domains (Fig. 3A). This pattern matches previous DCA results in ECF factors (32) and is indicative of the conserved secondary and tertiary structure   (Table 2). (C) Scatterplot of the top 21 DCA predictions against the distance between the alpha carbons of the predicted contacts, as derived from the four structures of ECF/ASDI complexes (Fig. 1). The top 14 predictions are in close proximity in most of the three-dimensional structures. Complexes are labeled after their anti-factor, where RseA corresponds to RpoE/RseA complex from E. coli (PDB accession no. 1OR7 [16]), ChrR to SigE/ChrR from R. sphaeroides (PDB accession no. 2Q1Z [13]), RsiW to SigW/RsiW from B. subtilis (PDB accession no. 5WUQ [14]), and RskA to SigK/RskA from M. tuberculosis (PDB accession no. 4NQW [15]). (D) Multiple-sequence alignment of two selected ECF/ASDI pairs, RpoE/RseA from E. coli and SigK/RskA from M. tuberculosis. Labels of the top 14 contacts indicate their position. The presence of alpha-helices and their names are depicted on top of the alignment. The sequence logo depicts the amino acid composition of the full ECF and ASDI alignments derived from 10,930 sequences, respectively. (E) Three-dimensional depiction of the top 14 predictions in the structure of RpoE/RseA complex (PDB accession no. 1OR7 [16]). ECF is colored in beige, and anti-factor is in gray. Predicted contacts are labeled according to their rank. N and C termini from ECF and antifactor are labeled. on this family of proteins. We also observed high scores interconnecting helices 1, 2, and 3 of the ASDI, while helix 4 shows no strong DCA coupling scores with other parts of the ASDI domain (Fig. 3A). This agrees with the cocrystal structures of ECF/ASDI complexes ( Fig. 1), where helices 1, 2, and 3 form a helix bundle, which is connected to helix 4 by a flexible linker (13)(14)(15)(16). We then focused on the predictions that link ECFs and ASDIs since these are the ones responsible for ECF inhibition. At first glance, the contact map shows several high DCA scores linking the fourth helix of the ASDI with the 2 domain (Fig. 3A). Under closer inspection, the top 14 interprotein contact predictions (DCA score Ն0.255) are located in close proximity in most of the crystal structures ( Fig. 3B and C). Of those, 12 are connecting the 2 domain and helix 4 of the ASDI, and two (DCA#10 and #11) connect a single residue of helix 1 of the ASDI to two residues located in the 4 domain of the ECF (Fig. 3E). In the first case, the predicted contact area includes ECF regions 2.1 and 2.2 ( Fig. 3D), whose main function is binding to the clamp helices of the ␤= subunit of the RNAP (33)(34)(35). Thus, it is likely that binding of ASDI's helix 4 to this area prevents ECF binding to the RNAP core, hampering ECF-dependent transcription when the anti-factor is present. Instead, DCA predictions #10 and #11 involve ECF helices 4.2 and 4.4 ( Fig. 3D), in two residues involved in the contact with the Ϫ35 element of the promoter (33,36). Taken together, the presence of these strong coevolutionary signals suggests that the majority of the 10,860 ASDI proteins establish contact to the ECF via these two binding interfaces, connecting the ASDI with both the 2 and 4 domains. However, although the top 14 DCA predictions connect residues located in close 3-dimensional proximity in most of the four resolved cocrystal structures of ECFs/ASDIs, only six are direct contacts in the four crystal structures (Fig. 3B). While the other 8 "close hits" could merely be false-positive predictions, it is tempting to speculate that these residues might form close contacts in other ECF/ASDI groups, which might take slightly different binding conformations from those captured by the four structures solved to date. Alternatively, these close hits may form transient contacts during the initial recognition between ASDI and ECF. Another observation was that 19 direct contacts that are shared between the four ECF/ASDI cocrystal structures were not predicted by DCA (Table 2), suggesting either that DCA fails to predict them or that these contacts are less prevalent in the remainder of the ECF/ASDI protein families.
To obtain a better overview of the residues involved in the contact interfaces, we plotted the residues predicted by DCA-both in the ECF and in the ASDI-for the 12 largest ASDI groups with more than 100 sequences (Fig. 4). The resulting logos showed that contacts involving ASDI's helix 1 and the 4 domain (DCA#10 and #11) are generally conserved within groups but different between groups. Predictions DCA#10 and #11 feature two main types of contacts, either a charged or a hydrophobic interaction (Fig. 4). This pattern is most evident for prediction DCA#11, which tends to harbor a positive amino acid in the ECF (e.g., R178 in RpoE E.coli ) and a negative residue in the ASDI (e.g., D11 in RseA E.coli ), as found in groups ECF02, ECF12, ECF14, ECF27, ECF235, and ECF245. However, in some cases this is replaced by a hydrophobic contact, typically with leucine on both the ECF and ASDI (e.g., L177 in SigK and L18 in RskA from Mycobacterium tuberculosis), as found in groups ECF17, ECF18, and ECF19. In contrast to these clear-cut contact motifs predicted for helix 1, residues in helix 4 of the ASDI (all predictions except DCA#10 and #11) exhibit a weaker conservation even within most of the ASDI groups (Fig. 4). This has some exceptions, such as the prediction DCA#7, featuring a conserved contact between an aromatic residue (W or Y) on the ASDI and a proline (P) on the ECF side in groups ECF12, ECF14, ECF27, and ECF245 (Fig. 4). Together with the observation that helix 4 of the ASDI holds most of the DCA predictions (Fig. 3D), this suggests that helix 4 is in charge of further determining the specificity of the ASDIs, keeping them orthogonal from other ASDIs of the same group. Indeed, antifactors that regulate ECFs from the same group have been found to be mostly orthogonal (37).
Specificity-determining positions of ASDI groups coincide with the predicted binding interfaces. Next, we asked whether the ASDI residues predicted to be in contact with the ECF are also key residues that determine the distinction between ASDI groups. If this was the case, it would suggest that ASDI groups would be primarily distinguished by their interaction with their respective ECF. Alternatively, if ASDI groups were primarily determined by residues outside predicted contact interfaces, this would argue that interactions with potential ligands or intraprotein interactions determine protein subfamilies (38). The presence of such group-specific amino residues-so-called specificity determining positions (SDPs)-can be detected by S3det, a bioinformatic tool based in multiple correspondence analysis that finds residues associated with subfamilies of proteins (39). Using this tool, we predicted SDPs by comparing every pair of the 12 largest ASDI groups and taking only the highest-scoring SDP prediction of every ASDI group into further consideration (see Materials and Methods). As a result, we identified five SDPs, named by running numbers (SDP#1 to SDP#5) from N to C terminus: two in helix 1, one in helix 3, one in helix 4, and the last one exclusively present in group AS243 (Fig. 5A). Proteins from group AS26 did not hold any prediction, since they do not fit well into the multiple sequence alignment of the full ASDI data set-probably due to extensive differences at the sequence level (cf. Fig. 4). Similarly, AS243's SDP#5 corresponds almost exclusively to a gapped position in the alignment with the rest of the groups, as indicated by the absence (or very narrow representation in the case of AS245) of conserved residues for SPD#5 (Fig. 5B). These differences at the sequence level might reflect functional differences between standard ASDIs and ASDIs from groups 243 and 26. In favor of this hypothesis, one member of AS243, FecR from E. coli, is distinguished from ASDIs of other (non-AS243) groups in that its 59 N-terminal amino acids are essential for ECF activity (40). Probably due to these unique features,  4.4 ) according to the presence of alpha-helices. "H" indicates alpha-helix. The rank of the DCA prediction is displayed in the second-to-last column when the interaction is predicted by DCA. If the residue is an SDP in the ASD, it is indicated in the last column. "Y" for yes is highlighted by bold in the last column.
the observed misalignment between members of AS243 and the other ASDI groups precludes further interpretation of the SDPs in this group. In contrast, all other predicted SDPs (except SDP#5) are part of the contact interfaces with the ECF in the existing crystal structures (Fig. 5C). Conserved position D11 in RseA E.coli , predicted by DCA (Fig. 3B, DCA#10 and #11), was part of the predicted SDPs (Fig. 5A, SDP#2). Yet another SDP, S56 in helix 4 (Fig. 5A, SDP#4), was predicted by DCA (Fig. 3B, DCA#1   Predicting ECF/ASDI Interactions to the 4 domain, usually in its last helix (Fig. 5C). Interestingly, SDPs #1, #2, and #3 form a cluster of interactions with the same area of the ECF, which usually corresponds to the last helix of the 4 domain, except in the SigE/ChrR structure, where the contact appears before this area (Fig. 5C). Thus, besides some exceptions in groups AS26 and AS243, these results suggest that the main characteristic that discriminates between ASDI groups is their ability to interact with the factors within their cognate ECF groups.
Given that these residues are conserved within phylogenetic ASDI groups, face the ECF in the solved ECF/ASDI crystal structures, and feature different amino acids in different groups, it is likely that they take part in determining specificity toward the target ECF. This is supported by that the fact that most of these SDPs are also DCA predictions ( Table 2).

DISCUSSION
In this study, we used a computational approach to study how the class I antifactor family members interact with their cognate ECF factors. Based on the similarity between ECF and ASDI phylogenies, we showed that these protein families have coevolved-likely because they are in direct contact with each other-and exploited this coevolution to predict two conserved binding interfaces for the ASDI/ECF interaction. Although previous work provided insight into the cocrystal structures of individual ASDI/ECF pairs, the present work puts these case studies into a broader, evolutionary perspective, by providing the first phylogenetic classification of the class I antifactor protein family. Interestingly, within the resulting AS groups-solely defined by the sequence of their ASDI domain-we observed a striking conservation of the fused protein domains. Compared to early work by Campbell et al. (13), the explosion in sequenced genomes in recent years allowed us to expand the ASDI data set from 1,266 to more than 10,000 putative ECF/ASDI pairs from NCBI reference genomes, providing a more comprehensive and phylogenetically balanced overview of the diversity of these proteins. In agreement with the work of Campbell et al. (13), we found that about one-third (ϳ32%) of all ECFs are genomically associated with, and thus likely regulated by, ASDIs. Yet, our expanded ASDI library showed important differences compared to previous work in that (i) we find more ASDIs containing a zinc-binding motif (ϳ56% compared to ϳ38% [13]); (ii) we find more cytoplasmic anti-factors (ϳ35% compared to ϳ28% [13]); (iii) cytoplasmic anti-factors are still overrepresented in zinc-binding motifs, but to a smaller extent (ϳ72% of the soluble anti-factors are zinc binding in our data set compared to 92% in reference 13); and (iv) membrane-bound ASDIs are not underrepresented in zinc-binding motifs as suggested in reference 13, with about half of the proteins (ϳ48%) being zinc-binding anti-factors. These data suggest that ASDIs are more diverse than previously thought and argue against a functional role of the zinc-binding domain exclusively in soluble antifactors. This is supported by the ASDI phylogenetic tree (Fig. 2), where zinc-and non-zinc-binding ASDI groups are mixed across the tree and sometimes even within the same group, as in the case of AS27 and AS19-1. In these mixed zinc-and non-zinc-binding groups, this suggests that the zinc-binding motif may play a structural instead of a sensory role, as shown for RsiW from B. subtilis (group AS245) (14).
Our analysis of DCA predictions and SDPs show that there exists a conserved, dual binding interface, with ASDI's helix 1 binding to the 4 domain and ASDI's helix 4 binding to the 2 domain. These results agree with crystal structures of ECF/ASDI complexes (13)(14)(15)(16) and suggest that the contacts seen in these few examples are indeed realized across the full ECF/ASDI families. Further, our results suggest that ASDI's helix 2 is not critical for ECF binding but is important for ASDI tertiary structure. ASDI's helix 3, which is located between ECF's 2 and 4 domains in three out of four structures (13,15,16), harbors an SDP involved in the interaction with 4 domain, in similar residues as those contacted by the prediction on helix 1. This modularity of the ASDI interaction is reflected in the function of the ECF residues involved in the predictions. Contacted residues in regions 2.1 and 2.2 are mostly involved in the contact with the clamp helices of the ␤= subunit of the RNAP (33,35), whereas predicted contacts in 4 are part of the contact interface with the Ϫ35 element of the promoter (33,36). The analysis of the DCA predictions revealed a different degree of conservation across ASDI groups, with the residues that take part in contacts between ASDI's helix 1 and ECF's 4 (DCA predictions #10 and #11) being conserved for most of the ECF and ASDI phylogenetic groups. Interestingly, this area, which connects D11 on the ASDI to R149 and R178 on the ECF (RseA/RpoE E.coli coordinates), bears two main types of interactions, that is, hydrophobic, which usually features leucine in both ECF and ASDI (Fig. 4, groups AS17, AS18, and AS19-1), or charged, usually featuring arginine in the ECF side and aspartate in the ASDI side (Fig. 4, groups AS02, AS12, and AS14, among others). Random mutagenesis in RseA E.coli (group AS02) showed that a single amino acid mutation of D11 to histidine completely inhibits RseA E.coli activity (41), confirming the key role of this contact. Given their group-specific conservation and the striking polarity differences between the two binding types, we speculate that D11 defines coarse-grained specificity of ASDIs for ECFs of the same binding type, usually found in the same phylogenetic group. However, ASDIs are usually specific to their own target ECF and do not usually cross talk with members of the same group (37), indicating that there are more sources of specificity in residues that are not conserved in groups. One potential source of this specificity is the residues predicted by DCA in helix 4. These residues are generally not conserved within groups (Fig. 4) and bind the 2 domain in all the solved crystal structures of ASDI/ECF complexes (13)(14)(15)(16). This lack of major conservation is extended to the predicted contacts on the ECF side, which are generally in charge of binding to the ␤= subunit of the RNAP.
Generality of the dual binding interface in other /antiinteractions? Paget classified anti-factors into two types, the ones that insert between 2 and 4 (RseA, RskA, and ChrR) and the ones that wrap around these domains (RsiW) (42). Our data show that despite these differences in binding topology, both types of ASDIs contact the two main binding interfaces described here. Moreover, a similar binding mode can be observed in the crystal structures of the ECF CnrH in complex with the class II antifactor CnrY, from Cupriavidus metallidurans (43). The two alpha-helices of CnrY wrap around CnrH in a conformation where CnrY's first alpha-helix mimics the function of ASDI's first helix and binds to the 4 domain, and CnrY's second and last alpha-helix binds to the 2 domain in a similar manner as ASDI's fourth helix. The only crystal structure of a member of the ASDIII class of anti-factors, BldN, in complex with the ECF factor RsbN from Streptomyces venezuelae (12) also shows this dual binding mode. In this case, the first and second alpha-helices of BldN bind to the 4 domain, whereas its third and last alpha-helix binds to the regions 2.1 and 2.2 of a different RsbN molecule, similarly to ASDI's fourth helix (12). The similarity of the binding between the three types of ECF anti-factors is striking and contrasts with their low level of sequence similarity, which is limited to ϳ11% for RseA/BldN and ϳ3% for RseA/CnrY (using global pairwise alignments calculated by the Needleman-Wunsch algorithm implemented at EBI [44]). This explains why, even though the same regions of the anti-factor interact with a similar area of the ECF in the three types of ECF antifactors, the specific residues that carry out the interaction with the ECF may differ between ASD types. It is unclear why bacteria need at least three types of ASDs. On one hand, different ASDs may provide extra specificity to ECF inhibition, which could help to reduce the apparent tendency to cross talk of antifactors (45). On the other hand, the three types of ASDs could have emerged from different proteins and optimized their ECF inhibition by blocking the same ECF regions through convergent evolution. Future analysis that includes all the ASDs known to date could help in understanding their evolution.
Interestingly, dual binding interfaces between and anti-factor extend beyond ECF factors. For instance, in E. coli the anti-factor FliM of the class 3 factor FliA (containing a 3 domain) also targets 2 and 4 regions with two different areas of the protein (46). However, the FliM inhibitory contacts are inverted relative to ECF antifactors: FliM is composed of four alpha-helices, of which the first and second bind to the surface of the 2 domain, similarly to the fourth helix of ASDIs. In FliM, the third and fourth helices are the ones that bind to 4 (46), similarly to the first helix of ASDIs. Interestingly, FliM does not bind to FliA's 3 domain, strengthening the idea that the blockage of both 2 and 4 is the core of factor inhibition. Whether this is also the case for housekeeping s and their anti-s remains to be seen, as to date only the interaction between the anti-factor Rsd and a truncated version of RpoD (containing only the 4 domain) was studied in E. coli (47,48). Thus, even though the present analysis was restricted to the interaction between ASDIs and ECFs, we suggest that the dual inhibition of RNAP-and DNA-binding interfaces is likely a universal feature of other antifactors, preventing formation of nonfunctional trimeric complexes between /antifactors and RNAP or DNA.

MATERIALS AND METHODS
General bioinformatic tools. Generally, multiple sequence alignments (MSAs) were generated by Clustal Omega 1.2.3. with options -iter ϭ 2 and -max-guidetree-iterations ϭ 1 and manually curated (49). However, UPP (50) (default options) was used for alignments subjected to DCA or to S3det, since they require stable columns of equivalent residues with few gaps. Hidden Markov models (HMMs) were built using hmmbuild function and used for scanning libraries using hmmscan function, both from HMMER suite 3.1b2 (51) and both with default parameters. For the extraction of the amino acid residue interactions between ECF and ASDI from cocrystal structures, we used Voronoi tessellation as implemented in Voronota version 1.19 (52). Protein structures were visualized using UCSF Chimera version 1.10.2 (53).
ASDI extraction. ASDIs were extracted from the genetic neighborhood (Ϯ10 coding sequences) of a library of 46,293 ECF factors in their most recent classification (3). In order to minimize taxonomic bias, these ECFs were extracted from organisms tagged as representative or reference species by NCBI (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/), using only RefSeq entries when both RefSeq and GenBank records are available for the same genome. To identify ASDI domain-containing proteins, we first used two HMMs, one built from the zinc-binding and another from the non-zinc-binding antifactors from the work of Staroń et al. (2). We selected the optimal bit score threshold for the retrieval of new ASDIs for each HMM by optimizing a receiver operator characteristic (ROC) curve using the function roc_courve from sklearn.metrics (54). Proteins that were used for the construction of each model were used as positive controls, and the remaining, non-ASDI anti-factors from the work of Staroń et al. (2) were used as negative controls. The resulting bit score thresholds, 0.4 for non-zinc-binding and 14.2 for zinc-binding models, were applied for the extraction of ASDIs from the set of putative anti-factors from reference 3. This resulted in 7,490 ASDIs, which were subsequently used for the construction of an extended HMM of the ASDI family. The thresholding bit score that best separates real ASDIs from other proteins was optimized using a ROC curve as described above, resulting in a bit score threshold of 0.2. We used the extended HMM to look for further members of the ASDI family in the genetic neighborhood of ECFs (Ϯ10 coding sequences) from reference 3. In order to lessen the bias toward frequently sequenced organisms, we included only proteins from representative or reference genomes as labeled by NCBI (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/), using only RefSeq entries when both RefSeq and GenBank records are available for the same genome. This yielded 11,939 putative ASDIcontaining proteins. We further curated these data, removing proteins with anti-domains shorter than 50 amino acids, since these could be anti-factors of class II (21). The area of the ASDI was defined as the envelope region of the highest-scoring hit of the extended HMM, discarding areas that are part of the transmembrane helices or extracellular. This resulted in 10,930 ASDIs, with an average length of 101 Ϯ 33 (standard deviation) amino acids.
Clustering of ASDIs. We clustered ASDIs according to amino acid sequence similarity. Given the large number of proteins, we first grouped them into clusters or closely related sequences, the so-called subgroups. These were built with a divisive strategy, where proteins were subjected to a bisecting K-means clustering approach until the maximum k-tuple distance between any protein of the cluster was smaller than 0.6, as measured by Clustal Omega with -distmat-out -full and -full-iter flags (49,55).
Bisecting K-means was implemented using KMeans function from the sklearn.cluster module (54). The 3,790 proteins that did not enter into any subgroup were left ungrouped. Thanks to this grouping, it was easier to see subgroups that may contain outliers that passed the HMM threshold but do not likely display antifactor activity. In order to distinguish and discard these outliers from our clustering, we assessed the presence of Pfam domains (Pfam 31.0 [27]) in the anti-factors from each subgroup. We discarded 132 subgroups (606 proteins) where the Pfam domains indicated an unlikely anti-factor function (data not shown). In summary, the resulting 1,475 subgroups defined during this process contained 6,534 proteins (ϳ60% of the starting ASDIs), with a median group size of 3 proteins and a standard deviation of 6.17 proteins. Given the low size of proteins in each subgroup, we further clustered the manually curated alignment of the consensus sequences of each subgroup into a maximum-likelihood phylogenetic tree using IQ-TREE version 1.5.5 (56) with 1,000 ultrafast bootstraps. As an outgroup of this tree, we included the antifactor class II CnrY, from Cupriavidus metallidurans. The resulting tree was visualized in iTOL (57) and split into monophyletic ASDI groups according to the ECF group of their cognate partner. With this strategy, we defined 23 ASDI groups, of which 12 contain more than 100 proteins.
The presence of a zinc-binding domain was assumed in ASDIs with a Hx 3 Cx 2 C sequence signature that expands over helix 2 and helix 3. Presence of transmembrane helices was assessed using the consensus prediction from online TopCons (22). The mode number of transmembrane helices was considered in order to plot the transmembrane helices for a whole subgroup of class I antifactors. In this way we avoid biases caused by the extremely large number of transmembrane helices in long, divergent proteins. The position of these helices for plotting was calculated according to the average start and end positions over the anti-factors in a subgroup. Similarly, the position of the ASDI domain across anti-factors from the same subgroup was calculated according to the average start and end positions of the envelope region of the lowest E value match to the extended HMM of the ASDI family, using hmmscan for HMMER suite 3.1b2 (51). The presence of other Pfam domains in full-length class I antifactors was evaluated using hmmscan function from HMMER suite 3.1b2 (51) with the library of HMMs from Pfam 31.0 (27). Pfam domains present in certain position of the MSA of the full-length antifactors in more than 50% of the members of a subgroup were plotted in the ASDI tree.
ASDI/ECF coevolution. In order to evaluate the coevolution of ECFs and ASDIs, we calculated the Pearson correlation coefficient (PCC) of the distances between cognate pairs of proteins, as introduced by Goh et al. (28). The significance of this PCC was evaluated similarly to reference 29. For this purpose, the PCCs between ASDIs and ECFs and of two extra families of proteins that did not coevolve and/or interact with ECFs or ASDIs were evaluated as negative controls. In our case, these negative controls were homologs of the E. coli housekeeping factor 70 (RefSeq accession no. NP_417539.1) and of the Bacillus subtilis antifactor RsbW (RefSeq accession no. WP_061902497), since proteins for these types have never been described to interact with ASDIs or ECFs, respectively. We extracted proteins from these types using online HMMER (51) with parameters -E 1 -domE 1 -incE 0.01 -incdomE 0.03 -mx BLOSUM62 -pextend 0.4 -popen 0.02 -seqdb uniprotrefprot and mapped the hit identifiers (IDs) from UniProt to GenBank using the UniProt's ID conversion tool (58). A total of 409 genomes contained the four protein families; these are ECFs, ASDIs, RsbW, and RpoD. For each organism, we selected one of the ECF-AS factor pairs and one homolog of RsbW and RpoD. These proteins had a taxonomically diverse origin, with 39% of the proteins from Firmicutes, 28% from Actinobacteria, 11% from Cyanobacteria, and the rest from eight other bacterial phyla. We calculated the pairwise distance for each protein family using Clustal Omega with -full and -distmat-out flags (49). The PCC was calculated from the flattened distance matrices using pearsonr function from Python's scipy.stats resource (59).
SDPs. Specificity determining positions (SDPs) were calculated with S3det (39) on the 12 ASDI groups with more than 100 proteins and on their cognate ECFs. Aligned ASDI (or ECF) proteins were extracted from the MSA used for DCA so as to preserve the same positional mapping. S3det was executed on every pair of ASDI (or ECF) groups, resulting in a set of ranked SDP predictions for every pair of groups. We scored the SDPs associated with every group as the sum of the inverse of their ranks across the different S3det runs with contribution of the group. The highest-scoring SDP for every group was considered positive, resulting in five SDPs.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.