ABSTRACT
Extracytoplasmic function σ factors (ECFs) belong to the most abundant signal transduction mechanisms in bacteria. Among the diverse regulators of ECF activity, class I anti-σ factors are the most important signal transducers in response to internal and external stress conditions. Despite the conserved secondary structure of the class I anti-σ factor domain (ASDI) that binds and inhibits the ECF under noninducing conditions, the binding interface between ECFs and ASDIs is surprisingly variable between the published cocrystal structures. In this work, we provide a comprehensive computational analysis of the ASDI protein family and study the different contact themes between ECFs and ASDIs. To this end, we harness the coevolution of these diverse protein families and predict covarying amino acid residues as likely candidates of an interaction interface. As a result, we find two common binding interfaces linking the first alpha-helix of the ASDI to the DNA-binding region in the σ4 domain of the ECF, and the fourth alpha-helix of the ASDI to the RNA polymerase (RNAP)-binding region of the σ2 domain. The conservation of these two binding interfaces contrasts with the apparent quaternary structure diversity of the ECF/ASDI complexes, partially explaining the high specificity between cognate ECF and ASDI pairs. Furthermore, we suggest that the dual inhibition of RNAP- and DNA-binding interfaces is likely a universal feature of other ECF anti-σ factors, preventing the formation of nonfunctional trimeric complexes between σ/anti-σ factors and RNAP or DNA.
IMPORTANCE In the bacterial world, extracytoplasmic function σ factors (ECFs) are the most widespread family of alternative σ factors, mediating many cellular responses to environmental cues, such as stress. This work uses a computational approach to investigate how these σ factors interact with class I anti-σ factors—the most abundant regulators of ECF activity. By comprehensively classifying the anti-σs into phylogenetic groups and by comparing this phylogeny to the one of the cognate ECFs, the study shows how these protein families have coevolved to maintain their interaction over evolutionary time. These results shed light on the common contact residues that link ECFs and anti-σs in different phylogenetic families and set the basis for the rational design of anti-σs to specifically target certain ECFs. This will help to prevent the cross talk between heterologous ECF/anti-σ pairs, allowing their use as orthogonal regulators for the construction of genetic circuits in synthetic biology.
INTRODUCTION
Extracytoplasmic function σ factors (ECFs) are one the most abundant signal transduction mechanisms in the bacterial kingdom, often mediating the cellular response to external and internal stress conditions. Although these minimalistic members of the σ70 family contain only the σ2 and σ4 domains essential for recruiting RNA polymerase (RNAP) to specific promoter sequences (1), ECFs have evolved into a surprisingly diverse protein family. By now, we know 156 phylogenetic ECF groups (2, 3)—many of which feature group-specific target promoter motifs and conserved regulators of ECF activity, suggesting similar modes of signal transduction within an ECF group. Among these diverse signaling mechanisms, the most common regulators of ECF activity are so-called anti-σ factors, which, under noninducing conditions, sequester ECF into inactive complexes via their anti-σ domain (ASD). Under inducing conditions, anti-σ factors release their ECFs by various mechanisms, including anti-σ proteolysis (4–6), conformational change (7, 8), or sequestration by ECF-mimicking anti-anti-σ factors (9, 10). Given that bacteria harbor an average of 10 ECFs, and some species encode more than 100 ECFs per genome (3), the pertinent question arises how the different ECF/anti-σ factor pairs prevent massive cross talk between each other, or in other words, how do they achieve signaling specificity?
To date, three structurally distinct classes of anti-σ factors, termed classes I to III, have been described in literature (11, 12). Here, we focus on the class I anti-σ factors, which are not only the first characterized (13) but also the most abundant anti-σs known to date (2, 3). Class I anti-σ factors are defined by their N-terminal anti-σ domain I (ASDI), which features a common secondary structure consisting of four alpha-helices: the first three (N-terminal) helices form a bundle that binds to the σ4 domain of the ECF, and separated by a flexible linker, the fourth helix binds to the σ2 domain. Interestingly, while this general theme has been found in all of the four crystal structures of ASDI/ECF complexes solved to date (13–16), these structures also expose a significant diversity in the binding topology between ECFs and ASDIs (Fig. 1). The most striking difference relates to the overall ECF/ASDI conformation (Fig. 1), showing that in three of the cocrystal structures, ChrR/SigERsp (Rhodobacter sphaeroides), RseA/RpoEEco (Escherichia coli), and RskA/SigKMtu (Mycobacterium tuberculosis), the ASDI is sandwiched between the σ2 and σ4 domains, while RsiW wraps around the two σ domains of SigW (Bacillus subtilis). Furthermore, while the three-helix bundle of some ASDIs require zinc coordination for ECF inhibition (ChrRRsp [13]), another structure features a zinc-binding motif but binds the ECF independently of zinc (RsiWB. subtilis [14]), and others do not rely on a zinc-binding motif at all (RskAMtu, RseAEco [15, 16]). Thus, it is tempting to speculate that the divergent binding topologies between ECFs and ASDIs could be important to prevent cross talk between ECF/ASDI pairs of different ECF groups and that these conformations might be conserved within ECF groups. If so, we reasoned that protein sequences of ECF and ASDI proteins have coevolved and that ASDI protein sequences should cluster into phylogenetic groups similar to the ECF groups.
Structures of ECF σ factors in complex with class I anti-σ factors. ECFs are shown in shades of pink, whereas anti-σ factors appear in shades of blue. Different areas of the protein are differentially colored (see legend). Different anti-σ factors show different binding conformations. (A) SigE-ChrR from R. sphaeroides (PDB accession no. 2Q1Z [13]). (B) SigW-RsiW from B. subtilis (PDB accession no. 5WUQ [14]). (C) RpoE-RseA from E. coli (PDB accession no. 1OR7 [16]). (D) SigK-RskA from M. tuberculosis (PDB accession no. 4NQW [15]).
Due to the diversity in ECF/ASDI quaternary structure, we here wondered whether there is a minimal contact interface conserved across all members of the ASDI family. To predict amino acid residues involved in such conserved contact interfaces, we turn to direct coupling analysis (DCA)—a bioinformatic method that exploits evolutionary covariation to predict contacting residues (17). When two residues interact, mutations in one need to be compensated by changes in the second so as to preserve the interaction (17). The same mechanism also applies for indirect contacts; however, DCA is able to distinguish direct from indirect interactions and considers only the former for the calculation of their covariation score (17). One of the highlights of DCA is that aside from stable conformations, it can also provide information on the transient, unstable conformations that occur during the dynamic process of interaction (18).
In this study, we set out to provide the first phylogenetic classification of ASDI proteins and reveal striking patterns of coevolution between these regulators and their cognate ECF σ factors. For the ECF/ASDI interaction, we used DCA to predict the residues that form the core ECF/ASDI contact. The arising sequence logos show divergent use of residues across ASDI groups, thus explaining the low binding affinity of noncognate ECF/ASDI pairs from different groups. However, the predicted interaction partners in the fourth helix of ASDI and their respective counterparts are less conserved even within the ASDI groups. This might explain how ASDI proteins maintain binding specificity even within ASDI groups. These results allow a first, in silico assessment of potential cross talk between two ECF/ASDI pairs without expensive in vivo testing, opening new ways to rationally design synthetic circuits using orthogonal ECF/ASDI pairs.
RESULTS
ASDI retrieval and classification.We focused on the class I anti-σ factors (ASDIs) as the main regulators of ECF σ factors, in order to gain a better understanding of their general binding mechanism for ECFs. Given that anti-σ factors are often coencoded with their ECF targets (1, 2, 13, 19, 20), we first set out to collect ASDIs from the genetic neighborhood of the ECF coding sequences. To this end, we focused on a set of 21,047 putative anti-σ factors identified during a recent classification effort for ECF σ factors by our group (3). To identify ASDI-containing proteins from this data set, we used hidden Markov models (HMMs) developed from a small data set of both zinc-binding and non-zinc-binding ASDIs published earlier by Staroń and colleagues (2) (see Materials and Methods for details). This step yielded 7,490 proteins, showing that ∼36% of all putative anti-σ factors are ASDIs. In order to further expand the size of the ASDI sequence library, we built a new extended HMM from the ASDI domain of these sequences. We used this extended model to search for ASDIs in the genetic neighborhood of all classified ECFs identified in reference 3, using only the 33,843 ECFs from representative and reference organisms as labeled by the National Center for Biotechnology Information (NCBI). This yielded 11,939 proteins, from which we removed the ones with ASDIs shorter than 50 amino acids, since these could be divergent class II anti-σ factors (21). The final number of ASDIs retrieved by this pipeline was 10,930, of which 10,806 have a nonredundant anti-σ domain. This shows that, on average, about one-third (∼32%) of the ECF σ factors contain a protein with an ASDI domain in their genetic neighborhood, suggesting that ASDIs are the most widespread regulators of ECF activity known to date. The average size of the ASDI domain was 101 ± 33 (standard deviation) amino acids.
To gain an idea about the evolutionary history of these regulators, we classified them according to sequence similarity of their ASDI domain into 1,475 clusters of closely related sequences. For that, we used a divisive strategy, where the pool of sequences was subjected to a bisecting K-means clustering algorithm until the maximum k-tuple distance among sequences in the cluster was smaller than an empirical threshold of 0.6 (see Materials and Methods). Then, the consensus sequences of these clusters (referred to as subgroups here) were hierarchically clustered into a phylogenetic tree (Fig. 2). A simple inspection of the ASDI tree shows that neighboring ASDI sequences on the tree also regulate σ factors from the same ECF group (Fig. 2, ring 2), supporting the notion that ECFs and ASDIs coevolved. Given the similarity between ECF and ASDI classifications, we split the ASDI tree into monophyletic groups that regulate σ factors from the same ECF group (Fig. 2, ring 1). This split usually agrees with high bootstrap values (Fig. 2; see also Fig. S1 in the supplemental material), suggesting that this definition of ASDI groups is robust to changes in the data set. As a result, ASDI groups were named with “AS” followed by a number dependent on the ECF group with which they are found in genomic proximity. Even though ASDIs from the same clade of the ASDI tree are usually coencoded with (and thus likely regulate) members of the same ECF group, some ASDI groups have slightly divergent features and are located in different clades of the ASDI tree. Two of these ASDI groups are, for example, AS19-1 and AS19-2, which regulate members of ECF19 (Fig. 2), but they are divergent in their ASDI helix 1 (consensus motif HTLAGAYALDAL in AS19-1 versus HLDPDQLALLA in AS19-2) and helix 2 (consensus motif LDDERAAFERHL in AS19-1 versus GEPLDADERAHL in AS19-2). Given that group AS19-2 is more closely related to AS27 than AS19-1, this suggests that these groups may have independently evolved the ability to bind to ECFs of group 19.
ASDI phylogenetic tree. Phylogenetic tree of the consensus sequences of subgroups of class I anti-σ factor domains. The tree is rooted at the sequence of the class II anti-σ factor CnrY, from Cupriavidus metallidurans, used as an outlier. Branch length indicates evolutionary distance. Internal branch colors indicate bootstrap values, where 0% is red and 100% is green. Rings are explained as follows: 1, ASDI group defined in this work; 2, ECF group of the cognate ECFs encoded in the same genetic neighborhoods; 3, presence of Zn-binding motif; and 4, average domain composition of the anti-σ factors associated with each subgroup. The most important domains are explained in the legend.
FIG S1
Copyright © 2020 Casas-Pastor et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
ASDIs that regulate ECFs from the same subgroup are usually located together in the tree but split into distinct ASDI subgroups (data not shown), probably due to the larger sequence diversity of anti-σ factors than of ECFs. We observed that, even though there was a mixture of zinc-binding and non-zinc-binding ASDIs in the input data set (as indicated by the presence or absence of an “Hx3Cx2C” motif), both types distribute across the ASDI tree, generating ASDI groups that are mixtures of zinc- and non-zinc-binding proteins, such as AS19-1 and AS27 (Fig. 2). Exceptions are groups AS33-1 and AS33-2, whose difference is the presence or absence of the zinc-binding domain, respectively (Fig. 2, ring 3).
Additionally, we predicted the mode number of transmembrane helices (TMHs) in the different ASDI subgroups using the consensus prediction from online TopCons (22). Most of the full-length anti-σ factor sequences (∼65%) are predicted to contain at least one TMH, suggesting that they are bound to the membrane, while the remaining ones (35%) are likely soluble anti-σ factors. Although the whole data set of ASDIs is composed of similar amounts of zinc- (∼56%) and non-zinc-binding (∼44%) proteins, we observed that among the soluble ASDIs there was an overrepresentation (∼72%) of sequences with a zinc-binding motif. This is consistent with the notion that cytoplasmic ASDIs are often involved in sensing intracellular redox conditions (13, 23, 24). The membrane-bound anti-σ factors contained ∼48% of sequences with a zinc-binding motif, contrasting with earlier observations that membrane-bound anti-σ factors showed an underrepresentation of zinc-binding domains (13). However, the data in this earlier work were based on a much smaller sequence data set of only 1,266 sequences (13), suggesting that this apparent bias may have been due to random sampling of the sequenced genomes at the time. Our finding of an approximately equal distribution of zinc- and non-zinc-binding motifs in the membrane-bound ASDIs indicates that the Zn-binding motif could be playing a nonsensing role, e.g., by taking a more static, structural function as is the case for RsiW in B. subtilis (14).
If the Zn-binding motif does not play an active sensory role, the general notion is that the ASDI domains have associated with additional protein domains that allow stimulus perception and ultimately trigger anti-σ factor release (25, 26). To assess the conservation of additional protein domains, usually located C terminal of the ASDI domain, we scanned full-length class I anti-σ factors with Pfam 31.0 models (27) as well as the extended model of the ASDI domain. When indicating the positions of these domains in the different class I anti-σ factor subgroups (Fig. 2, ring 4), we found that the protein domains associated with ASDIs are typically well conserved for ASDIs from the same group but differ between groups. This suggests that ASDIs regulating members of the same ECF group are likely sensing similar input cues, by binding directly either to the triggering molecule or to other sensory proteins. The full list of ASDIs, together with their partner ECF and their ASDI group and subgroup, can be found in Table S1.
TABLE S1
Copyright © 2020 Casas-Pastor et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Given the ample degree of correlation between ECF and ASDI classifications, we evaluated whether these families coevolved. For this, we calculated the Pearson correlation coefficient (PCC) of the pairwise distance matrices of ASDIs and ECFs, as described by Goh et al. (28), leading to a PCC of 0.82. In order to determine the significance of this correlation coefficient, we adopted the strategy of Dintner et al. (29) and included as negative controls RsbW-like anti-σ factors and RpoD-like σ factors, which do not interact with ECFs and ASDIs, respectively: RsbW is the anti-σ factor of the alternative σ factor σB and a protein kinase of the anti-anti-σ factor RsbV in Bacillus subtilis (30), while RpoD is the housekeeping σ factor of Escherichia coli (31). For these negative controls, we obtained low PCCs (0.5 to 0.6), which are similar to the ones obtained by Dintner et al. for negative controls in bacterial two-component systems (29). This indicates that the PCC of 0.82 obtained for the correlation between ECFs and ASDIs is highly significant, showing that there has been strong coevolution between these protein families (Table 1).
Pearson’s correlation coefficient of the distances for ECF and ASDI pairs in organisms that contain RsbW-like and RpoD-like proteins, used as negative controls for lack of correlation
A taxonomic analysis of the ASDI protein family (Fig. S2A) further shows that ASDI groups are often composed of sequences originating from a single bacterial phylum, e.g., for AS02 (Proteobacteria), AS245 (Firmicutes), or AS12 (Actinobacteria), consistent with the observation that some ECF groups are phylum specific (3). While most other ASDI groups are also typically composed of sequences from a dominant phylum, they often contain a few ASDI subgroups from other phyla, e.g., for AS11, AS26, or AS243, suggesting that these subgroups may have resulted from horizontal gene transfer. Another interesting observation is that while all ASDIs from AS12 are found in Actinobacteria, the regulated ECFs make up only 89% of the sequences in the ECF12 group (3). The other ECF12s are found mostly in Bacteroidetes (∼7%) and Proteobacteria (∼2%), and the fact that they do not feature an ASDI in their genomic neighborhood indicates that they either are regulated by orphan ASDIs or have adopted another mode of regulating ECF activity (3). In fact, when more closely examining the phylum-specific frequency of ASDIs in the genomic neighborhood of ECFs (Fig. S2B), we found that in Bacteroidetes only 6% of ECFs are associated with ASDIs, which is significantly lower than the average of 32% found across all phyla. This is consistent with the observation by Staroń et al. (2), who noted that the Bacteroidetes group of RpoE-like proteins (ECF03) also lacks a conserved anti-σ factor in their genomic context, suggesting again either that gene synteny is broken or that other modes of ECF regulation, e.g., via other anti-σ classes, are dominant in this phylum. Other phyla, in contrast, feature a strong overabundance of ASDI-associated ECFs, such as the Gemmatimonadetes (∼79%) or the Chloroflexi (72%), but further studies are needed to identify the origin of this taxonomic bias.
FIG S2
Copyright © 2020 Casas-Pastor et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
DCA predicts two main contact interfaces between ASDIs and ECFs.Given the variability in the binding conformations in the four published ECF/ASDI cocrystal structures, we next wondered whether there exist universally conserved “core-binding interfaces” that are shared within the whole family of ASDI proteins, or whether the strong coevolution between the protein families gave rise to fundamentally different binding conformations. To identify potentially conserved contact interfaces, we sought to exploit the coevolutionary information between our ASDI data set (above) and the ECF classification (3). Specifically, we aimed at predicting amino acid residues on ASDIs and ECFs that display significant covariation, suggesting that they are in direct contact and that the mutation in one residue is balanced by a compensatory mutation in its binding partner. To this end, we applied direct coupling analysis (DCA) (17) to the full set of ASDIs and their cognate ECFs (Table S1). The results of this analysis revealed a large amount of high DCA scores within the σ2 and σ4 domains of the ECF σ factor and also connecting the two σ domains (Fig. 3A). This pattern matches previous DCA results in ECF σ factors (32) and is indicative of the conserved secondary and tertiary structure on this family of proteins. We also observed high scores interconnecting helices 1, 2, and 3 of the ASDI, while helix 4 shows no strong DCA coupling scores with other parts of the ASDI domain (Fig. 3A). This agrees with the cocrystal structures of ECF/ASDI complexes (Fig. 1), where helices 1, 2, and 3 form a helix bundle, which is connected to helix 4 by a flexible linker (13–16). We then focused on the predictions that link ECFs and ASDIs since these are the ones responsible for ECF inhibition. At first glance, the contact map shows several high DCA scores linking the fourth helix of the ASDI with the σ2 domain (Fig. 3A). Under closer inspection, the top 14 interprotein contact predictions (DCA score ≥0.255) are located in close proximity in most of the crystal structures (Fig. 3B and C). Of those, 12 are connecting the σ2 domain and helix 4 of the ASDI, and two (DCA#10 and #11) connect a single residue of helix 1 of the ASDI to two residues located in the σ4 domain of the ECF (Fig. 3E). In the first case, the predicted contact area includes ECF regions 2.1 and 2.2 (Fig. 3D), whose main function is binding to the clamp helices of the β′ subunit of the RNAP (33–35). Thus, it is likely that binding of ASDI’s helix 4 to this area prevents ECF binding to the RNAP core, hampering ECF-dependent transcription when the anti-σ factor is present. Instead, DCA predictions #10 and #11 involve ECF helices 4.2 and 4.4 (Fig. 3D), in two residues involved in the contact with the −35 element of the promoter (33, 36). Taken together, the presence of these strong coevolutionary signals suggests that the majority of the 10,860 ASDI proteins establish contact to the ECF via these two binding interfaces, connecting the ASDI with both the σ2 and σ4 domains.
DCA results on the contact between ECFs and ASDIs. (A) DCA contact map. Each axis represents the concatenated protein sequences of RpoE and RseA, from E. coli, used as reference for the amino acid labeling. High DCA scores, indicated by darker colors, correspond to residues with a high likelihood to bind in vivo. The 14 highest scores (DCA score ≥0.255) are marked in the heatmap and labeled according to their rank. (B) Table of the 14 highest-scoring DCA predictions, mapped to the amino acid coordinates of RpoE and RseA from E. coli. The common contact (CC) column indicates the DCA predictions that are also common contacts observed in the four crystal structures of ECFs/ASDIs, as derived by Voronoi tessellation (Table 2). (C) Scatterplot of the top 21 DCA predictions against the distance between the alpha carbons of the predicted contacts, as derived from the four structures of ECF/ASDI complexes (Fig. 1). The top 14 predictions are in close proximity in most of the three-dimensional structures. Complexes are labeled after their anti-σ factor, where RseA corresponds to RpoE/RseA complex from E. coli (PDB accession no. 1OR7 [16]), ChrR to SigE/ChrR from R. sphaeroides (PDB accession no. 2Q1Z [13]), RsiW to SigW/RsiW from B. subtilis (PDB accession no. 5WUQ [14]), and RskA to SigK/RskA from M. tuberculosis (PDB accession no. 4NQW [15]). (D) Multiple-sequence alignment of two selected ECF/ASDI pairs, RpoE/RseA from E. coli and SigK/RskA from M. tuberculosis. Labels of the top 14 contacts indicate their position. The presence of alpha-helices and their names are depicted on top of the alignment. The sequence logo depicts the amino acid composition of the full ECF and ASDI alignments derived from 10,930 sequences, respectively. (E) Three-dimensional depiction of the top 14 predictions in the structure of RpoE/RseA complex (PDB accession no. 1OR7 [16]). ECF is colored in beige, and anti-σ factor is in gray. Predicted contacts are labeled according to their rank. N and C termini from ECF and anti-σ factor are labeled.
However, although the top 14 DCA predictions connect residues located in close 3-dimensional proximity in most of the four resolved cocrystal structures of ECFs/ASDIs, only six are direct contacts in the four crystal structures (Fig. 3B). While the other 8 “close hits” could merely be false-positive predictions, it is tempting to speculate that these residues might form close contacts in other ECF/ASDI groups, which might take slightly different binding conformations from those captured by the four structures solved to date. Alternatively, these close hits may form transient contacts during the initial recognition between ASDI and ECF. Another observation was that 19 direct contacts that are shared between the four ECF/ASDI cocrystal structures were not predicted by DCA (Table 2), suggesting either that DCA fails to predict them or that these contacts are less prevalent in the remainder of the ECF/ASDI protein families.
Common contacts between ECFs and ASDIs found in the four cocrystal structures by Voronoi tessellationa
To obtain a better overview of the residues involved in the contact interfaces, we plotted the residues predicted by DCA—both in the ECF and in the ASDI—for the 12 largest ASDI groups with more than 100 sequences (Fig. 4). The resulting logos showed that contacts involving ASDI’s helix 1 and the σ4 domain (DCA#10 and #11) are generally conserved within groups but different between groups. Predictions DCA#10 and #11 feature two main types of contacts, either a charged or a hydrophobic interaction (Fig. 4). This pattern is most evident for prediction DCA#11, which tends to harbor a positive amino acid in the ECF (e.g., R178 in RpoEE.coli) and a negative residue in the ASDI (e.g., D11 in RseAE.coli), as found in groups ECF02, ECF12, ECF14, ECF27, ECF235, and ECF245. However, in some cases this is replaced by a hydrophobic contact, typically with leucine on both the ECF and ASDI (e.g., L177 in SigK and L18 in RskA from Mycobacterium tuberculosis), as found in groups ECF17, ECF18, and ECF19. In contrast to these clear-cut contact motifs predicted for helix 1, residues in helix 4 of the ASDI (all predictions except DCA#10 and #11) exhibit a weaker conservation even within most of the ASDI groups (Fig. 4). This has some exceptions, such as the prediction DCA#7, featuring a conserved contact between an aromatic residue (W or Y) on the ASDI and a proline (P) on the ECF side in groups ECF12, ECF14, ECF27, and ECF245 (Fig. 4). Together with the observation that helix 4 of the ASDI holds most of the DCA predictions (Fig. 3D), this suggests that helix 4 is in charge of further determining the specificity of the ASDIs, keeping them orthogonal from other ASDIs of the same group. Indeed, anti-σ factors that regulate ECFs from the same group have been found to be mostly orthogonal (37).
Sequence logos of the top 14 DCA predictions, computed for the 12 ASDI groups with more than 100 sequences. The sequence logos show the amino acid composition for the DCA-predicted contact points for both the ECF and anti-σ factor in each ECF/ASDI group. The contacts are ordered from left to right according to their DCA rank, as indicated on top. The sequence logos are manually arranged based on their similarity.
Specificity-determining positions of ASDI groups coincide with the predicted binding interfaces.Next, we asked whether the ASDI residues predicted to be in contact with the ECF are also key residues that determine the distinction between ASDI groups. If this was the case, it would suggest that ASDI groups would be primarily distinguished by their interaction with their respective ECF. Alternatively, if ASDI groups were primarily determined by residues outside predicted contact interfaces, this would argue that interactions with potential ligands or intraprotein interactions determine protein subfamilies (38). The presence of such group-specific amino residues—so-called specificity determining positions (SDPs)—can be detected by S3det, a bioinformatic tool based in multiple correspondence analysis that finds residues associated with subfamilies of proteins (39). Using this tool, we predicted SDPs by comparing every pair of the 12 largest ASDI groups and taking only the highest-scoring SDP prediction of every ASDI group into further consideration (see Materials and Methods). As a result, we identified five SDPs, named by running numbers (SDP#1 to SDP#5) from N to C terminus: two in helix 1, one in helix 3, one in helix 4, and the last one exclusively present in group AS243 (Fig. 5A). Proteins from group AS26 did not hold any prediction, since they do not fit well into the multiple sequence alignment of the full ASDI data set—probably due to extensive differences at the sequence level (cf. Fig. 4). Similarly, AS243’s SDP#5 corresponds almost exclusively to a gapped position in the alignment with the rest of the groups, as indicated by the absence (or very narrow representation in the case of AS245) of conserved residues for SPD#5 (Fig. 5B). These differences at the sequence level might reflect functional differences between standard ASDIs and ASDIs from groups 243 and 26. In favor of this hypothesis, one member of AS243, FecR from E. coli, is distinguished from ASDIs of other (non-AS243) groups in that its 59 N-terminal amino acids are essential for ECF activity (40). Probably due to these unique features, the observed misalignment between members of AS243 and the other ASDI groups precludes further interpretation of the SDPs in this group. In contrast, all other predicted SDPs (except SDP#5) are part of the contact interfaces with the ECF in the existing crystal structures (Fig. 5C). Conserved position D11 in RseAE.coli, predicted by DCA (Fig. 3B, DCA#10 and #11), was part of the predicted SDPs (Fig. 5A, SDP#2). Yet another SDP, S56 in helix 4 (Fig. 5A, SDP#4), was predicted by DCA (Fig. 3B, DCA#1 and #5). Predictions SDP#1 and SDP#3 connect S7 in helix 1 and Y36 in helix 3 in RseAE.coli to the σ4 domain, usually in its last helix (Fig. 5C). Interestingly, SDPs #1, #2, and #3 form a cluster of interactions with the same area of the ECF, which usually corresponds to the last helix of the σ4 domain, except in the SigE/ChrR structure, where the contact appears before this area (Fig. 5C). Thus, besides some exceptions in groups AS26 and AS243, these results suggest that the main characteristic that discriminates between ASDI groups is their ability to interact with the σ factors within their cognate ECF groups.
Description of the specificity determining positions (SDPs) that distinguish different ASDI groups. (A) Multiple-sequence alignment of the anti-σ factors RseA from E. coli and RskA from M. tuberculosis showing the position of the SDPs, labeled with numbers according to sequence position. Alpha-helices and their names are indicated with red boxes on the ASDI sequences. The sequence logo shows the amino acid composition of the full ASDI alignment. (B) Logo of SDPs in every ASDI group with more than 100 proteins. Positions are labeled as in panel A. (C) ASDI specificity determining positions plotted in the structure of ECF/ASDI complexes. ECFs are colored in beige, and anti-σ factors are in gray; SDPs are colored in green and labeled with their identifier as in panel A. The RpoE/RseA complex is present in E. coli (PDB accession no. 1OR7 [16]), the SigK/RskA complex is in M. tuberculosis (Mtu, PDB accession no. 4NQW [15]), SigW/RsiW is in B. subtilis (Bsu, PDB accession no. 5WUQ [14]), and SigE/ChrR is in R. sphaeroides (Rsp, PDB accession no. 2Q1Z [13]). Contacts with the ECF are represented by connector lines.
Given that these residues are conserved within phylogenetic ASDI groups, face the ECF in the solved ECF/ASDI crystal structures, and feature different amino acids in different groups, it is likely that they take part in determining specificity toward the target ECF. This is supported by that the fact that most of these SDPs are also DCA predictions (Table 2).
DISCUSSION
In this study, we used a computational approach to study how the class I anti-σ factor family members interact with their cognate ECF σ factors. Based on the similarity between ECF and ASDI phylogenies, we showed that these protein families have coevolved—likely because they are in direct contact with each other—and exploited this coevolution to predict two conserved binding interfaces for the ASDI/ECF interaction. Although previous work provided insight into the cocrystal structures of individual ASDI/ECF pairs, the present work puts these case studies into a broader, evolutionary perspective, by providing the first phylogenetic classification of the class I anti-σ factor protein family. Interestingly, within the resulting AS groups—solely defined by the sequence of their ASDI domain—we observed a striking conservation of the fused protein domains. Compared to early work by Campbell et al. (13), the explosion in sequenced genomes in recent years allowed us to expand the ASDI data set from 1,266 to more than 10,000 putative ECF/ASDI pairs from NCBI reference genomes, providing a more comprehensive and phylogenetically balanced overview of the diversity of these proteins. In agreement with the work of Campbell et al. (13), we found that about one-third (∼32%) of all ECFs are genomically associated with, and thus likely regulated by, ASDIs. Yet, our expanded ASDI library showed important differences compared to previous work in that (i) we find more ASDIs containing a zinc-binding motif (∼56% compared to ∼38% [13]); (ii) we find more cytoplasmic anti-σ factors (∼35% compared to ∼28% [13]); (iii) cytoplasmic anti-σ factors are still overrepresented in zinc-binding motifs, but to a smaller extent (∼72% of the soluble anti-σ factors are zinc binding in our data set compared to 92% in reference 13); and (iv) membrane-bound ASDIs are not underrepresented in zinc-binding motifs as suggested in reference 13, with about half of the proteins (∼48%) being zinc-binding anti-σ factors. These data suggest that ASDIs are more diverse than previously thought and argue against a functional role of the zinc-binding domain exclusively in soluble anti-σ factors. This is supported by the ASDI phylogenetic tree (Fig. 2), where zinc- and non-zinc-binding ASDI groups are mixed across the tree and sometimes even within the same group, as in the case of AS27 and AS19-1. In these mixed zinc- and non-zinc-binding groups, this suggests that the zinc-binding motif may play a structural instead of a sensory role, as shown for RsiW from B. subtilis (group AS245) (14).
Our analysis of DCA predictions and SDPs show that there exists a conserved, dual binding interface, with ASDI’s helix 1 binding to the σ4 domain and ASDI’s helix 4 binding to the σ2 domain. These results agree with crystal structures of ECF/ASDI complexes (13–16) and suggest that the contacts seen in these few examples are indeed realized across the full ECF/ASDI families. Further, our results suggest that ASDI’s helix 2 is not critical for ECF binding but is important for ASDI tertiary structure. ASDI's helix 3, which is located between ECF’s σ2 and σ4 domains in three out of four structures (13, 15, 16), harbors an SDP involved in the interaction with σ4 domain, in similar residues as those contacted by the prediction on helix 1. This modularity of the ASDI interaction is reflected in the function of the ECF residues involved in the predictions. Contacted residues in regions 2.1 and 2.2 are mostly involved in the contact with the clamp helices of the β′ subunit of the RNAP (33, 35), whereas predicted contacts in σ4 are part of the contact interface with the −35 element of the promoter (33, 36).
The analysis of the DCA predictions revealed a different degree of conservation across ASDI groups, with the residues that take part in contacts between ASDI’s helix 1 and ECF’s σ4 (DCA predictions #10 and #11) being conserved for most of the ECF and ASDI phylogenetic groups. Interestingly, this area, which connects D11 on the ASDI to R149 and R178 on the ECF (RseA/RpoEE.coli coordinates), bears two main types of interactions, that is, hydrophobic, which usually features leucine in both ECF and ASDI (Fig. 4, groups AS17, AS18, and AS19-1), or charged, usually featuring arginine in the ECF side and aspartate in the ASDI side (Fig. 4, groups AS02, AS12, and AS14, among others). Random mutagenesis in RseAE.coli (group AS02) showed that a single amino acid mutation of D11 to histidine completely inhibits RseAE.coli activity (41), confirming the key role of this contact. Given their group-specific conservation and the striking polarity differences between the two binding types, we speculate that D11 defines coarse-grained specificity of ASDIs for ECFs of the same binding type, usually found in the same phylogenetic group. However, ASDIs are usually specific to their own target ECF and do not usually cross talk with members of the same group (37), indicating that there are more sources of specificity in residues that are not conserved in groups. One potential source of this specificity is the residues predicted by DCA in helix 4. These residues are generally not conserved within groups (Fig. 4) and bind the σ2 domain in all the solved crystal structures of ASDI/ECF complexes (13–16). This lack of major conservation is extended to the predicted contacts on the ECF side, which are generally in charge of binding to the β′ subunit of the RNAP.
Generality of the dual binding interface in other σ/anti-σ interactions?Paget classified anti-σ factors into two types, the ones that insert between σ2 and σ4 (RseA, RskA, and ChrR) and the ones that wrap around these domains (RsiW) (42). Our data show that despite these differences in binding topology, both types of ASDIs contact the two main binding interfaces described here. Moreover, a similar binding mode can be observed in the crystal structures of the ECF CnrH in complex with the class II anti-σ factor CnrY, from Cupriavidus metallidurans (43). The two alpha-helices of CnrY wrap around CnrH in a conformation where CnrY’s first alpha-helix mimics the function of ASDI’s first helix and binds to the σ4 domain, and CnrY’s second and last alpha-helix binds to the σ2 domain in a similar manner as ASDI’s fourth helix. The only crystal structure of a member of the ASDIII class of anti-σ factors, BldN, in complex with the ECF σ factor RsbN from Streptomyces venezuelae (12) also shows this dual binding mode. In this case, the first and second alpha-helices of BldN bind to the σ4 domain, whereas its third and last alpha-helix binds to the regions 2.1 and 2.2 of a different RsbN molecule, similarly to ASDI’s fourth helix (12). The similarity of the binding between the three types of ECF anti-σ factors is striking and contrasts with their low level of sequence similarity, which is limited to ∼11% for RseA/BldN and ∼3% for RseA/CnrY (using global pairwise alignments calculated by the Needleman-Wunsch algorithm implemented at EBI [44]). This explains why, even though the same regions of the anti-σ factor interact with a similar area of the ECF in the three types of ECF anti-σ factors, the specific residues that carry out the interaction with the ECF may differ between ASD types. It is unclear why bacteria need at least three types of ASDs. On one hand, different ASDs may provide extra specificity to ECF inhibition, which could help to reduce the apparent tendency to cross talk of anti-σ factors (45). On the other hand, the three types of ASDs could have emerged from different proteins and optimized their ECF inhibition by blocking the same ECF regions through convergent evolution. Future analysis that includes all the ASDs known to date could help in understanding their evolution.
Interestingly, dual binding interfaces between σ and anti-σ factor extend beyond ECF σ factors. For instance, in E. coli the anti-σ factor FliM of the class 3 σ factor FliA (containing a σ3 domain) also targets σ2 and σ4 regions with two different areas of the protein (46). However, the FliM inhibitory contacts are inverted relative to ECF anti-σ factors: FliM is composed of four alpha-helices, of which the first and second bind to the surface of the σ2 domain, similarly to the fourth helix of ASDIs. In FliM, the third and fourth helices are the ones that bind to σ4 (46), similarly to the first helix of ASDIs. Interestingly, FliM does not bind to FliA’s σ3 domain, strengthening the idea that the blockage of both σ2 and σ4 is the core of σ factor inhibition. Whether this is also the case for housekeeping σs and their anti-σs remains to be seen, as to date only the interaction between the anti-σ factor Rsd and a truncated version of RpoD (containing only the σ4 domain) was studied in E. coli (47, 48). Thus, even though the present analysis was restricted to the interaction between ASDIs and ECFs, we suggest that the dual inhibition of RNAP- and DNA-binding interfaces is likely a universal feature of other anti-σ factors, preventing formation of nonfunctional trimeric complexes between σ/anti-σ factors and RNAP or DNA.
MATERIALS AND METHODS
General bioinformatic tools.Generally, multiple sequence alignments (MSAs) were generated by Clustal Omega 1.2.3. with options –iter = 2 and –max-guidetree-iterations = 1 and manually curated (49). However, UPP (50) (default options) was used for alignments subjected to DCA or to S3det, since they require stable columns of equivalent residues with few gaps. Hidden Markov models (HMMs) were built using hmmbuild function and used for scanning libraries using hmmscan function, both from HMMER suite 3.1b2 (51) and both with default parameters. For the extraction of the amino acid residue interactions between ECF and ASDI from cocrystal structures, we used Voronoi tessellation as implemented in Voronota version 1.19 (52). Protein structures were visualized using UCSF Chimera version 1.10.2 (53).
ASDI extraction.ASDIs were extracted from the genetic neighborhood (±10 coding sequences) of a library of 46,293 ECF σ factors in their most recent classification (3). In order to minimize taxonomic bias, these ECFs were extracted from organisms tagged as representative or reference species by NCBI (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/), using only RefSeq entries when both RefSeq and GenBank records are available for the same genome. To identify ASDI domain-containing proteins, we first used two HMMs, one built from the zinc-binding and another from the non-zinc-binding anti-σ factors from the work of Staroń et al. (2). We selected the optimal bit score threshold for the retrieval of new ASDIs for each HMM by optimizing a receiver operator characteristic (ROC) curve using the function roc_courve from sklearn.metrics (54). Proteins that were used for the construction of each model were used as positive controls, and the remaining, non-ASDI anti-σ factors from the work of Staroń et al. (2) were used as negative controls. The resulting bit score thresholds, 0.4 for non-zinc-binding and 14.2 for zinc-binding models, were applied for the extraction of ASDIs from the set of putative anti-σ factors from reference 3. This resulted in 7,490 ASDIs, which were subsequently used for the construction of an extended HMM of the ASDI family. The thresholding bit score that best separates real ASDIs from other proteins was optimized using a ROC curve as described above, resulting in a bit score threshold of 0.2. We used the extended HMM to look for further members of the ASDI family in the genetic neighborhood of ECFs (±10 coding sequences) from reference 3. In order to lessen the bias toward frequently sequenced organisms, we included only proteins from representative or reference genomes as labeled by NCBI (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/), using only RefSeq entries when both RefSeq and GenBank records are available for the same genome. This yielded 11,939 putative ASDI-containing proteins. We further curated these data, removing proteins with anti-σ domains shorter than 50 amino acids, since these could be anti-σ factors of class II (21). The area of the ASDI was defined as the envelope region of the highest-scoring hit of the extended HMM, discarding areas that are part of the transmembrane helices or extracellular. This resulted in 10,930 ASDIs, with an average length of 101 ± 33 (standard deviation) amino acids.
Clustering of ASDIs.We clustered ASDIs according to amino acid sequence similarity. Given the large number of proteins, we first grouped them into clusters or closely related sequences, the so-called subgroups. These were built with a divisive strategy, where proteins were subjected to a bisecting K-means clustering approach until the maximum k-tuple distance between any protein of the cluster was smaller than 0.6, as measured by Clustal Omega with –distmat-out –full and –full-iter flags (49, 55). Bisecting K-means was implemented using KMeans function from the sklearn.cluster module (54). The 3,790 proteins that did not enter into any subgroup were left ungrouped. Thanks to this grouping, it was easier to see subgroups that may contain outliers that passed the HMM threshold but do not likely display anti-σ factor activity. In order to distinguish and discard these outliers from our clustering, we assessed the presence of Pfam domains (Pfam 31.0 [27]) in the anti-σ factors from each subgroup. We discarded 132 subgroups (606 proteins) where the Pfam domains indicated an unlikely anti-σ factor function (data not shown). In summary, the resulting 1,475 subgroups defined during this process contained 6,534 proteins (∼60% of the starting ASDIs), with a median group size of 3 proteins and a standard deviation of 6.17 proteins. Given the low size of proteins in each subgroup, we further clustered the manually curated alignment of the consensus sequences of each subgroup into a maximum-likelihood phylogenetic tree using IQ-TREE version 1.5.5 (56) with 1,000 ultrafast bootstraps. As an outgroup of this tree, we included the anti-σ factor class II CnrY, from Cupriavidus metallidurans. The resulting tree was visualized in iTOL (57) and split into monophyletic ASDI groups according to the ECF group of their cognate partner. With this strategy, we defined 23 ASDI groups, of which 12 contain more than 100 proteins.
The presence of a zinc-binding domain was assumed in ASDIs with a Hx3Cx2C sequence signature that expands over helix 2 and helix 3. Presence of transmembrane helices was assessed using the consensus prediction from online TopCons (22). The mode number of transmembrane helices was considered in order to plot the transmembrane helices for a whole subgroup of class I anti-σ factors. In this way we avoid biases caused by the extremely large number of transmembrane helices in long, divergent proteins. The position of these helices for plotting was calculated according to the average start and end positions over the anti-σ factors in a subgroup. Similarly, the position of the ASDI domain across anti-σ factors from the same subgroup was calculated according to the average start and end positions of the envelope region of the lowest E value match to the extended HMM of the ASDI family, using hmmscan for HMMER suite 3.1b2 (51). The presence of other Pfam domains in full-length class I anti-σ factors was evaluated using hmmscan function from HMMER suite 3.1b2 (51) with the library of HMMs from Pfam 31.0 (27). Pfam domains present in certain position of the MSA of the full-length anti-σ factors in more than 50% of the members of a subgroup were plotted in the ASDI tree.
ASDI/ECF coevolution.In order to evaluate the coevolution of ECFs and ASDIs, we calculated the Pearson correlation coefficient (PCC) of the distances between cognate pairs of proteins, as introduced by Goh et al. (28). The significance of this PCC was evaluated similarly to reference 29. For this purpose, the PCCs between ASDIs and ECFs and of two extra families of proteins that did not coevolve and/or interact with ECFs or ASDIs were evaluated as negative controls. In our case, these negative controls were homologs of the E. coli housekeeping σ factor σ70 (RefSeq accession no. NP_417539.1) and of the Bacillus subtilis anti-σ factor RsbW (RefSeq accession no. WP_061902497), since proteins for these types have never been described to interact with ASDIs or ECFs, respectively. We extracted proteins from these types using online HMMER (51) with parameters -E 1 –domE 1 –incE 0.01 –incdomE 0.03 –mx BLOSUM62 –pextend 0.4 –popen 0.02 –seqdb uniprotrefprot and mapped the hit identifiers (IDs) from UniProt to GenBank using the UniProt’s ID conversion tool (58). A total of 409 genomes contained the four protein families; these are ECFs, ASDIs, RsbW, and RpoD. For each organism, we selected one of the ECF-AS factor pairs and one homolog of RsbW and RpoD. These proteins had a taxonomically diverse origin, with 39% of the proteins from Firmicutes, 28% from Actinobacteria, 11% from Cyanobacteria, and the rest from eight other bacterial phyla. We calculated the pairwise distance for each protein family using Clustal Omega with –full and –distmat-out flags (49). The PCC was calculated from the flattened distance matrices using pearsonr function from Python’s scipy.stats resource (59).
DCA.Direct coupling analysis (DCA) was applied to the 10,930 putative ASDIs extracted during this work (see Table S1 in the supplemental material). ASDIs and their cognate ECF partners were aligned independently using UPP (50) with default parameters, and the resulting alignments were concatenated. Gaussian DCA with default parameters (60) was performed on this alignment (N = 275, M = 10,934, Meff = 965.52, theta = 0.46). The top DCA predictions were mapped into the crystal structures of RpoE/RseA from Escherichia coli (AS02, PDB accession no. 1OR7 [16]), SigE/ChrR from Rhodobacter sphaeroides (AS11, PDB accession no. 2Q1Z [13]), SigK/RskA from Mycobacterium tuberculosis (AS19-1, PDB accession no. 4NQW [15]), and SigW/RsiW from Bacillus subtilis (AS245, PDB accession no. 5WUQ [14]). Distances between predictions were calculated using the Bio.PDB module (61, 62) and Chimera (53). The 14 predictions that connected residues in close proximity (<15 Å) in most of the structures were considered true interactions.
SDPs.Specificity determining positions (SDPs) were calculated with S3det (39) on the 12 ASDI groups with more than 100 proteins and on their cognate ECFs. Aligned ASDI (or ECF) proteins were extracted from the MSA used for DCA so as to preserve the same positional mapping. S3det was executed on every pair of ASDI (or ECF) groups, resulting in a set of ranked SDP predictions for every pair of groups. We scored the SDPs associated with every group as the sum of the inverse of their ranks across the different S3det runs with contribution of the group. The highest-scoring SDP for every group was considered positive, resulting in five SDPs.
ACKNOWLEDGMENTS
This work was supported by a grant from the ERA-SynBio program via the Federal Ministry of Education and Research (Germany; grant 031L0010B) and the LOEWE program of the State of Hesse (Germany).
FOOTNOTES
- Received April 8, 2020.
- Accepted July 17, 2020.
- Copyright © 2020 Casas-Pastor et al.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.