Co-evolutionary analysis reveals a conserved dual binding interface between extracytoplasmic function (ECF) σ factors and class I anti-σ factors

Extracytoplasmic function σ factors (ECFs) belong to the most abundant signal transduction mechanisms in bacteria. Amongst the diverse regulators of ECF activity, class I anti-σ factors are the most important signal transducers in response to internal and external stress conditions. Despite the conserved secondary structure of the class I anti-σ factor domain (ASDI) that binds and inhibits the ECF under non-inducing conditions, the binding interface between ECFs and ASDIs is surprisingly variable between the published co-crystal structures. In this work, we provide a comprehensive computational analysis of the ASDI protein family and study the different contact themes between ECFs and ASDIs. To this end, we harness the co-evolution of these diverse protein families and predict covarying amino acid residues as likely candidates of an interaction interface. As a result, we find two common binding interfaces linking the first α-helix of the ASDI to the DNA binding region in the σ4 domain of the ECF, and the fourth α-helix of the ASDI to the RNA polymerase (RNAP) binding region of the σ2 domain. The conservation of these two binding interfaces contrasts with the apparent quaternary structure diversity of the ECF/ASDI complexes, partially explaining the high specificity between cognate ECF and ASDI pairs. Furthermore, we suggest that the dual inhibition of RNAP- and DNA-binding interfaces are likely a universal feature of other ECF anti-σ factors, preventing the formation of non-functional trimeric complexes between σ/anti-σ factors and RNAP or DNA. Significance In the bacterial world, extracytoplasmic function σ factors (ECFs) are the most widespread family of alternative σ factors, mediating many cellular responses to environmental cues, such as stress. This work uses a computational approach to investigate how these σ factors interact with class I anti-σ factors – the most abundant regulators of ECF activity. By comprehensively classifying the anti-σs into phylogenetic groups and by comparing this phylogeny to the one of the cognate ECFs, the study shows how these protein families have co-evolved to maintain their interaction over evolutionary time. These results shed light on the common contact residues that link ECFs and anti-σs in different phylogenetic families and set the basis for the rational design of anti-σs to specifically target certain ECFs. This will help to prevent the cross-talk between heterologous ECF/anti-σ pairs, allowing their use as orthogonal regulators for the construction of genetic circuits in synthetic biology.

Introduction predictions (Fig. 3D), this suggests that helix 4 is in charge of further determining the specificity of the 239 ASDI, keeping them orthogonal from other ASDIs of the same group. Indeed, anti-σ factors that 240 regulate ECFs from the same group have been found to be mostly orthogonal (37). 241 242 Specificity-determining positions of ASDI groups coincide with the predicted binding 243 interfaces 244 Next, we asked whether the ASDI residues predicted to be in contact with the ECF are also key 245 residues that determine the distinction between ASDI groups. If this was the case, it would suggest 246 that ASDI groups would be primarily distinguished by their interaction with their respective ECF. 247 Alternatively, if ASDI groups would primarily be determined by residues outside predicted contact 248 interfaces, this would argue that interactions with potential ligands or intra-protein-interactions 249 determine protein subfamilies (38). The presence of such group-specific amino residuesso-called 250 Specificity Determining Positions (SDPs)can be detected by S3det, a bioinformatic tool based in 251 multiple correspondence analysis that finds residues associated to subfamilies of proteins (39). Using 252 this tool, we predicted SDPs by comparing every pair of the 12 largest ASDI groups and taking only 253 the highest scoring SDP prediction of every ASDI group into further consideration (see Methods). As 254 a result, we identified five SDPs, named by running numbers (SDP#1 to SDP#5) from N-to C-255 terminus: two in helix 1, one in helix 3, one in helix 4 and the last one exclusively present in group 256 AS243 (Fig. 5A). Proteins from group AS26 did not hold any prediction, since they do not fit well into 257 the multiple sequence alignment of the full ASDI datasetprobably due to extensive differences at 258 the sequence level (cf. Fig. 4). Similarly, AS243's SDP corresponds almost exclusively to a gapped 259 position in the alignment with the rest of the groups (Fig. 5C #5). These differences at sequence level 260 might reflect functional differences between standard ASDIs and ASDIs from groups 243 and 26. In 261 favor of this hypothesis, one member of AS243, FecR from E. coli, is distinguished from other non-262 AS243 ASDIs in that its 59 N-terminal amino acids are required for ECF activity (40). Interestingly, all 263 predicted SDPs are part of the contact interfaces with the ECF in the existing crystal structures (Fig.  264 5B, Fig. S2). Conserved position D11 in RseA E.coli , predicted by DCA (Fig. 3B, #10 and #11), was part 265 of the predicted SDPs (Fig. 5A, SDP#2). Yet another SDP, V27 in helix 4 (Fig. 5A, SDP#4), was 266 predicted by DCA (Fig. 3B, #1 and #5). Predictions SDP#1 and SDP#3 connect S7 in helix 1 and Y36 267 in helix 3 in RseA E.coli to the σ 4 domain, usually in its last helix (Fig. 5B, Fig. S2). Interestingly, SDPs 268 #1, 2 and 3 form a cluster of interactions with the same area of the ECF, which usually corresponds to 269 the last helix of the σ 4 domain, except in SigE/ChrR structure, where the contact appears before this 270 area (Fig. 5B, Fig. S2). Thus, besides some exceptions in groups AS26 and AS243, these results 271 suggest that the main characteristic that discriminates between ASDI groups is their ability to interact 272 with the σ factors within their cognate ECF groups. 273 Given that these residues are conserved within phylogenetic ASDI groups, face the ECF in the solved 274 ECF/ASDI crystal structures and feature different amino acids in different groups, it is likely that they 275 take part in determining specificity towards the target ECF. This is supported that the fact that most of 276 these SDPs are also DCA predictions (Table 2).

279
In this study, we used a computational approach to study how class I anti-σ factor family interact with 280 their cognate ECF σ factors. Based on the similarity between ECF and ASDI phylogenies, we showed 281 that these protein families have co-evolvedlikely because they are in direct contact with each other 282 and exploited this co-evolution to predict two conserved binding interfaces for the ASDI/ECF 283 interaction. Although previous work provided insight in the co-crystal structures of individual 284 ASDI/ECF pairs, the present work puts these case-studies into a broader, evolutionary perspective, 285 by providing the first phylogenetic classification of the class I anti-σ factor protein family. Interestingly, 286 within the resulting AS groupssolely defined by the sequence of their ASDI domainwe observed 287 a striking conservation of the fused protein domains. Compared to early work by Campbell and 288 colleagues (11), the explosion in sequenced genomes in recent years allowed us to expand the ASDI 289 dataset from 1266 to more than 10,000 putative ECF/ASDI pairs from NCBI reference genomes, 290 providing a more comprehensive and phylogenetically balanced overview on the diversity of these 291 proteins. In agreement with Campbell et al. we found that about one third (~32%) of all ECFs are 292 genomically associated with, and thus likely regulated by ASDIs. Yet, our expanded ASDI library 293 showed important differences compared to previous work in that, (i) we find more ASDIs containing a 294 zinc-binding motif (~56% compared to ~38% (11)), (ii) we find more cytoplasmic anti-σ factors (~35% 295 compared to ~28% (11)), (iii) cytoplasmic anti-σ factors are still overrepresented in zinc-binding 296 motifs, but to a smaller extent (~72% of the soluble anti-σ factors are zinc-binding in our dataset 297 compared to 92% in (11)), and (iv) membrane-bound ASDIs are not underrepresented in zinc-binding 298 motifs as suggested in (11), with about half of the proteins (~48%) being zinc binding anti-σ factors. 299 These data suggest that ASDIs are more diverse than previously thought, and argues against a 300 functional role of the zinc-binding domain exclusively in soluble anti-σ factors. This is supported by the 301 ASDI phylogenetic tree (Fig. 2), where zinc and non-zinc binding ASDI groups are mixed across the 302 tree and sometimes even within the same group, as in the case of AS27, and AS19-1. In these mixed 303 zinc and non-zinc binding groups, this suggests that the zinc-binding motif may play a structural 304 instead of a sensory role, as shown for RsiW from B. subtilis (group AS245) (12). 305 Our analysis of DCA predictions and SDPs show that there exists a conserved, dual binding interface, 306 with ASDI's helix 1 binding to the σ 4 domain and ASDI's helix 4 binding to the σ 2 domain. These 307 results agree with crystal structures of ECF/ASDI complexes (11-14) and suggest that the contacts 308 seen in these few examples are indeed realized across the full ECF/ASDI families. Further, our 309 results suggest that ASDI's helix 2 is not critical for ECF binding but is important for ASDI tertiary 310 structure. ASDI´s helix 3, which is located between ECF's σ 2 and σ 4 domains in three out of four 311 structures (11, 13, 14), harbors a SDP involved in the interaction with σ 4 domain, in similar residues 312 as contacted by the prediction on helix 1. This modularity of the ASDI interaction is reflected in the 313 function of the ECF residues involved in the predictions. Contacted residues in regions 2.1 and 2.2 314 are mostly involved in the contact with the clamp helices of the β' subunit of the RNAP (33, 35), 315 whereas predicted contacts in σ 4 are part of the contact interface with the -35 element of the promoter 316 (33, 36).
The analysis of the DCA predictions revealed a different degree of conservation across ASDI groups, 318 with the residues that take part in contacts between ASDI's helix 1 and ECF's σ 4 (DCA predictions 319 #10 and #11) being conserved for most of the ECF and ASDI phylogenetic groups. Interestingly, this 320 area, which connects D11 on the ASDI to R149 and R178 on the ECF (RseA/RpoE E.coli coordinates) 321 bears two main types of interactions, that is, hydrophobic, which usually features leucine in both ECF 322 and ASDI (Fig. 4, groups AS17, AS18 and AS19-1), or charged, usually featuring arginine in the ECF 323 side and aspartate in the ASDI side (Fig. 4, groups AS02, AS12 and AS14, among others). Random 324 mutagenesis in RseA E.coli (group AS02) showed that a single amino acid mutation of D11 to histidine 325 completely inhibits RseA E.coli activity (41), confirming the key role of this contact. Given their group-326 specific conservation and the striking polarity differences between the two binding types, we 327 speculate that D11 defines coarse-grained specificity of ASDIs for ECFs of the same binding type, 328 usually found in the same phylogenetic group. However, ASDIs are usually specific to their own target 329 ECF and do not usually crosstalk with members of the same group (37), indicating that there are more 330 sources of specificity in residues that are not conserved in groups. One potential source of this 331 specificity are the residues predicted by DCA in helix 4. These residues are generally not conserved 332 within groups (Fig. 4) and bind the σ 2 domain in all the solved crystal structures of ASDI/ECF 333 complexes (11-14). This lack of major conservation is extended to the predicted contacts on the ECF 334 side, which are generally in charge of binding to the β' subunit of the RNAP. 335 336

Generality of the dual binding interface in other σ/anti-σ interactions? 337
Paget classified anti-σ factors into two types, the ones that insert between σ 2 and σ 4 (RseA, RskA and 338 ChrR) and the ones that wrap around these domains (RsiW) (42). Our data shows that despite these 339 differences in binding topology, both types of ASDIs contact the two main binding interfaces described 340 here. Moreover, a similar binding mode can be observed in the crystal structures of the ECF CnrH in 341 complex with the class II anti-σ factor CnrY, from Cupriavidus metallidurans (43). The two α helices of 342 CnrY wrap around CnrH in a conformation where CnrY's first α helix mimics the function of ASDI's 343 first helix and binds to σ 4 domain, and CnrY's second and last α helix binds to σ 2 domain in a similar 344 manner as ASDI's fourth helix. The only crystal structure of a member of the ASDIII class of anti-σ 345 factors, BldN, in complex with the ECF σ factor RsbN from Streptomyces venezuelae (44) also shows 346 this dual binding mode. In this case, the first and second α helices of BldN bind to the σ 4 domain, 347 whereas its third and last α helix binds to the regions 2.1 and 2.2 of a different RsbN molecule, 348 similarly to ASDI's forth helix (44). The similarity of the binding between the three types of ECF anti-σ 349 factors is striking and contrasts with their low level of sequence similarity, which is limited to ~11% for 350 RseA/BldN and ~3% for RseA/CnrY (using global pairwise alignments calculated by Needleman-351 Wunsch algorithm implemented at EBI (45)). This explains why, even though the same regions of the 352 anti-σ factor interact with a similar area of the ECF in the three types of ECF anti-σ factors, the 353 specific residues that carry out the interaction with the ECF may differ between ASD types. It is 354 unclear why bacteria need at least three types of ASDs. On one hand, different ASDs may provide 355 extra specificity to ECF inhibition, which could help to reduce the apparent tendency to cross-talk of proteins and optimized their ECF inhibition by blocking the same ECF regions through convergent 358 evolution. Future analysis that include all the ASDs known to date could help in understanding their 359 evolution. 360 Interestingly, dual binding interfaces between σ and anti-σ factor extend beyond ECF σ factors. For 361 instance, in E. coli the anti-σ factor FliM of the class 3 σ factor FliA (containing a σ 3 domain) also 362 targets σ 2 and σ 4 regions with two different areas of the protein (47). However, the FliM inhibitory 363 contacts are inverted relative to ECF anti-σ factors: FliM is composed of four α helices, of which the 364 first and second bind to the surface of the σ 2 domain, similarly to the fourth helix of ASDIs. In FliM, the 365 third and fourth helices are the ones that bind to σ 4 (47), similarly to the first helix of ASDIs. 366 Interestingly, FliM does not bind to FliA's σ 3 domain, strengthening the idea that the blockage of both 367 σ 2 and σ 4 are the core of σ factor inhibition. Whether this is also the case for housekeeping σs and 368 their anti-σs remains to be seen, as to date only the interaction between the anti-σ factor Rsd and a 369 anti-σ factors from (3). This resulted in 7,490 ASDIs, which were subsequently used for the 406 construction of an extended HMM of the ASDI family. The thresholding bit-score that best separates 407 real ASDIs from other proteins was optimized using a ROC curve as described above, resulting in a 408 bit-score threshold of 0.2. We used the extended HMM to look for further members of the ASDI family 409 in the genetic neighborhood of ECFs (±10 coding sequences) from (3). In order to lessen the bias 410 towards frequently sequenced organisms, we only included proteins from representative or reference 411 genomes as labelled by NCBI (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/), using only 412 RefSeq entries when both RefSeq and GenBank records are available for the same genome. This 413 yielded 11,939 putative ASDI-containing proteins. We further curated these data removing proteins 414 with anti-σ domains shorter than 50 amino acids, since these could be anti-σ factors of class II (19). 415 The area of the anti-σ domain was defined as the envelope region of the highest scoring hit of the 416 extended HMM, discarding areas that are part of the transmembrane helices or extracellular. This 417 resulted in 10,930 ASDIs, with an average length of 101± 33 (standard deviation) amino acids. 418 419

Clustering of ASDIs 420
We clustered ASDIs according to amino acid sequence similarity. Given the large number of proteins, 421 we first grouped them into clusters or closely related sequences, the so-called subgroups. These 422 were built with a divisive strategy, where proteins were subjected to a bisecting K-means clustering 423 approach until the maximum k-tuple distance between any protein of the cluster is smaller than 0.6, 424 as measured by Clustal Omega with --distmat-out --full and --full-iter flags (50, 56). Bisecting K-means 425 was implemented using KMeans function from sklearn.cluster module (55). The 3,790 proteins that 426 did not enter into any subgroup were left ungrouped. Thanks to this grouping it was easier to see 427 subgroups that may contain outliers that passed the HMM threshold, but do not likely display anti-σ 428 factor activity. In order to distinguish and discard these outliers from our clustering, we assessed the 429 presence of Pfam domains (Pfam 31.0 (26)) in the anti-σ factors from each subgroup. We discarded 430 132 subgroups (606 proteins) where the Pfam domains indicated an unlikely anti-σ factor function 431 (data not shown). In summary, the resulting 1,475 subgroups defined during this process contained 432 6,534 proteins (~60% of the starting ASDIs), with a median group size of 3 proteins and a standard 433 deviation of 6.17 proteins. Given the low size of proteins in each subgroup, we further clustered the 434 manually curated alignment of the consensus sequences of each subgroup, into a maximum-outgroup of this tree, we included the anti-σ factor class II CnrY, from Cupriavidus metallidurans. The 437 resulting tree was visualized in iTOL (58) and split into monophyletic ASDI groups according to the 438 ECF group of their cognate partner. With this strategy we defined 23 ASDI groups, of which 12 439 contain more than 100 proteins. 440 The presence of a zinc-binding domain was assumed in ASDIs with a Hx 3 Cx 2 C sequence signature 441 that expands over helix 2 and helix 3. The significance of this PCC was evaluated similar to (28). For this purpose, the PCCs between 459 ASDIs, ECFs and of two extra families of proteins that did not co-evolve and/or interact with ECFs or 460 ASDIs were evaluated as negative controls. In our case, these negative controls were homologs of E. 461 coli's housekeeping σ factor σ 70 (RefSeq: NP_417539.1) and of Bacillus subtilis' anti-σ factor RsbW 462 (RefSeq: WP_061902497), since proteins for these types have never been described to interact with 463 ASDIs nor ECFs, respectively. We extracted proteins from these types using online HMMER (52)  DCA was applied to the 10,930 putative ASDIs extracted during this work (Table S1). ASDI and their 476 cognate ECF partners were aligned independently using UPP (51) with default parameters, and the 477 resulting alignments were concatenated. Gaussian DCA with default parameters (61)    indicates the DCA prediction that are also common contacts observed in the four crystal structures of 698 ECFs/ASDIs, as derived by Voronoi tessellation (see Table 2). C: Scatter plot of the top 21 DCA 699 predictions against the distance between the alpha carbons of the predicted contacts, as derived from 700 the four structures of ECF/ASDI complexes (Fig. 1)