Genome and Functional Characterization of Colonization Factor Antigen I- and CS6-Encoding Heat-Stable Enterotoxin-Only Enterotoxigenic Escherichia coli Reveals Lineage and Geographic Variation

Comparative genomics and functional characterization were used to analyze a global collection of CFA/I and CS6 ST-only ETEC isolates associated with human diarrhea, demonstrating differences in the genomic content of CFA/I and CS6 isolates related to CF type, lineage, and geographic location of isolation and also lineage-related differences in ST production. Complete genome sequencing of selected CFA/I and CS6 isolates enabled descriptions of a highly conserved ST-positive (ST+) CFA/I plasmid and of at least five diverse ST and/or CS6 plasmids among the CS6 ETEC isolates. There is currently no approved vaccine for ST-only ETEC, or for any ETEC for that matter, and as such, the current report provides functional verification of ST and CF production and antimicrobial susceptibility testing and an in-depth genomic characterization of a collection of isolates that could serve as representatives of CFA/I- or CS6-encoding ST-only ETEC strains for future studies of ETEC pathogenesis, vaccine studies, and/or clinical trials.

sentatives of CFA/I-or CS6-encoding ST-only ETEC strains for future studies of ETEC pathogenesis, vaccine studies, and/or clinical trials. KEYWORDS Escherichia coli, comparative genomics, heat-stable toxin E nterotoxigenic Escherichia coli (ETEC) is a leading cause of severe diarrheal illness each year among children under 5 years of age (1) and is also a leading cause of traveler's diarrhea among adults (2,3). ETEC isolates are characterized by the heat-labile enterotoxin (LT) and/or the heat-stable enterotoxin (ST) (3)(4)(5). The human ST (STh) variant is the most prevalent ST toxin associated with human diarrhea, while the porcine ST (STp) variant was originally identified in ETEC associated with porcine diarrhea and is more prevalent among ETEC isolates from animals (2,6). ETEC colonization factors (CFs) also play an important role in the ability of ETEC to cause disease by facilitating adherence to the intestinal epithelium (3,7). At least 27 CFs have been functionally described to date (7,8); however, the most prevalent CFs are colonization factor antigen I (CFA/I) and CS1 to CS6 (2,3,7,(9)(10)(11).
The Global Enteric Multicenter Study (GEMS), a large-scale (LS) prospective casecontrol study investigating the causes of childhood diarrhea in countries of Africa and Asia (12), identified ETEC as one of the top four leading causes of moderate-to-severe diarrhea (MSD) in children under 5 years of age (1). A critical finding of the GEMS investigation was that ST-encoding ETEC isolates (with or without the copresence of LT) were significantly associated with MSD whereas ETEC isolates that encoded only LT were not associated with MSD (1,67). These findings corroborate the idea of the epidemiological significance of diarrhea associated with ST-encoding ETEC isolates, which have been considered a public health concern since their initial description in the 1970s (13).
Although ST-only ETEC strains are a significant global childhood health concern, there is currently no approved vaccine for this important diarrheal pathogen, and previous controlled human infection model (CHIM) studies performed with ETEC utilized only a limited number of isolates (14)(15)(16)(17)(18)(19), most of which were selected based on phenotypic data without the interrogation of genomic information. Thus, in the current study we used comparative genomics and functional characterization to examine the diversity of ST-only ETEC isolates, focusing on isolates with CFA/I or CS6, as these are two of the most prevalent CF types historically associated with human diarrheal illness and were found to be similarly prevalent among cases in GEMS (2,3,7,(9)(10)(11)67). We characterized the genomes of 269 ST-only ETEC isolates from two well-described and geographically diverse ETEC collections, including 162 CFA/I-encoding ST-only ETEC isolates and 107 CS6-encoding ST-only ETEC isolates, here referred to as CFA/I ETEC and CS6 ETEC, respectively. Also, we used long-read sequencing to complete the genome assemblies of 20 CS6 ETEC isolates and 6 CFA/I ETEC isolates, to provide additional insight into the unique genomic content, including ST-and/or CF-encoding virulence plasmids, of representative CS6 ETEC and CFA/I ETEC isolates associated with human diarrheal illness. CFA/I-or CS6-encoding ETEC isolates, which represent two of the most dominant CF types identified among the ETEC isolates in GEMS and other studies (2,3,7,(9)(10)(11).
Laboratory-based prescreening of the ETEC isolates led us to select and examine the genome contents of 269 unique ETEC isolates that encode either CFA/I or CS6 (162 CFA/I and 107 CS6 isolates) (see Table S1 in the supplemental material). The 269 CFA/I and CS6 genomes had sizes of 4.7 to 5.7 Mb and GC content of 50.09% to 50.97% (Table S1), which is consistent with previously sequenced ETEC genomes (20,21). The CFA/I and CS6 genomes had 30 different predicted multilocus sequence types (MLST). However, 60% (162/269) of the ETEC genomes were one of two MLST sequence types (ST2332 and ST443), while 17 sequence types were represented by only a single genome (Table S1). The CFA/I and CS6 genomes were represented by 43 different serotypes (Table S1). As with the MLST results, eight serotypes were dominant (O128ac: H45, O115:H5, O114:H45, O128ac:H12, O71:H45, O148:H28, ONT:H45, and O114:H5) and represented 74% (199/269) of the genomes, while 26 of the serotypes were represented by a single genome (Table S1). Previous ETEC genome assemblies have contained as many as six plasmids in a single isolate (20,21); therefore, it was not surprising that the number of predicted replicon types identified in each of the genomes ranged from 0 to as many as 8 (Table S1). The most prevalent plasmid replicons were IncFIB(AP001918) in 66% (177/269), IncFII(AY458016) in 32% (85/269), and IncFII(pCoo) in 21% (56/269) of the genomes (Table S1). The prevalence of IncFIB and IncFII plasmids is consistent with previous studies that have reported the association of E. coli virulence genes with these plasmid types (20,22).
ST production levels differ by lineage but not by CF type. The presence of genes encoding ST among the ETEC isolates was confirmed via PCR and in silico analysis of their genome assemblies; however, we wanted to examine whether there is variability in the functional production of the ST toxin by selected CFA/I and CS6 isolates. We examined 35 CFA/I and 19 CS6 isolates for their ability to produce and secrete ST into culture medium using chemically defined 4AA media (23). ST binds to the intestinal guanylate cyclase C receptor, which is expressed on human colonic cell line T84 and stimulates the buildup of intracellular cyclic GMP (cGMP) as previously described (24). A range of ST-induced cGMP accumulation was observed from the CFA/I and CS6 isolate supernatants, suggesting that some isolates do not made significant ST while others made robust amounts of ST under the conditions examined (Fig. 2). Two of the ETEC isolates (a86 and 702052) had no detectable ST production and did not contain an STh or STp gene in their genome assemblies, suggesting that the ST-encoding plasmids were lost from these isolates. There were no significant differences with respect to the amount of ST produced by CFA/I isolates compared with CS6 isolates (Fig. 2). Also, there were no observed lineage-specific differences in ST production among the CFA/I isolates; however, the CS6 isolates exhibited lineage-specific differences in ST production (Fig. 2). The CS6 ETEC of lineage L8 produced more ST than the CS6 ETEC of lineage L5 (P Ͻ 0.001) (Fig. 2).
CFA/I and CS6 ETEC genomes contain CF-, phylogroup-, and lineage-specific genes. To determine whether there were any genes associated with particular lineages . The filled symbols indicate the genes that were identified by an initial PCR screen and also in the genome assembly, while an open symbol indicates genes that were detected by PCR but absent from the genome assembly. The CS6 ETEC isolates that were subjected to additional sequencing to generate complete genomes are indicated with a green rectangle around the isolate label, while the CFA/I ETEC isolates are indicated with a blue rectangle around the isolate label. The E. coli phylogroups are designated by letters (A, B1, B2, D, E, and F), while the previously described ETEC phylogenomic lineages are indicated by the designations L1 to L21 (with the exception of L14, for which we could not obtain a high-quality assembly for the references) (10). of CS6 ETEC or CFA/I ETEC, we used a gene-based approach to identify their shared and unique gene content. After excluding genomes that had LT genes or were missing the genes encoding ST, CFA/I, and CS6, we found that there were a total of 142 CFA/I genomes and 87 CS6 genomes for further analysis. We compared these genomes to each other as well as to a diverse collection of 37 ETEC reference genomes representing other CF types, which carry the genes for LT and/or ST (Table 1; see also Table S1). There were no genes in addition to the CS6-encoding genes that were present in all of the CS6 ETEC isolates and absent from the CFA/I ETEC isolates and only one gene in addition to the CFA/I genes that was present in all of the CFA/I genomes and absent from the CS6 genomes (Table 1; see also Table S2A and C). The additional gene that was unique to the CFA/I ETEC is identical to a region of previously sequenced ETEC isolate H10407 plasmid p948 that encodes CFA/I (GenBank accession no. FN649418.1).
The number of genes that were shared among the CFA/I or CS6 genomes increased for genomes of the same phylogroup or lineage, demonstrating that there were a greater number of phylogroup and lineage-specific genes than genes associated with CF type ( Table 1). The number of lineage-specific genes that were identified in all genomes of one lineage and absent from other genomes of the same CF type ranged from 50 to 136 among the three dominant CS6 lineages (L4, L5, and L8), and from 60 to 78 among the dominant CFA/I lineages (L3, L6, and L15) ( Table 1). These findings demonstrate that certain lineages of CFA/I or CS6 ETEC had a greater number of lineage-specific genes. The genes that were conserved at the phylogroup level among the CFA/I or CS6 ETEC isolates included genes associated with a type II secretion system (T2SS) and genes with predicted functions involved in metabolism, while the genes that were unique to particular lineages included genes associated with metabolism and also mobile-element-associated genes, especially phage-associated genes (Table S2).
CF-associated distribution of toxins and other virulence genes among the CFA/I and CS6 ETEC isolates. In silico detection of the ST and LT genes in each of the PCR-based presumptive ST-only ETEC genomes verified that 89% (239/269) of the genomes had only the ST gene and not the LT genes, whereas four genomes had the genes for both LT and STh (Table S1). Although all of the ETEC isolates included in this study were PCR positive (PCR ϩ ) for the ST gene, 9% (26/269) of the isolates were missing this gene from their genome assemblies (Table S1). There were 18 presumptive CS6 ETEC genomes that were missing the genes that encode CS6, with 61% (11/18) of these genomes also missing ST, and 19 genomes were missing the genes that encode CFA/I, with 74% (14/19) of these genomes also missing ST (Table S1). The gene encoding ST and the CS6 and CFA/I genes typically occur on plasmids that in some instances have demonstrated instability (20)(21)(22)(25)(26)(27). Thus, it is possible that these ETEC isolates had previously carried an ST-encoding and/or CS6-or CFA/I-encoding plasmid that was lost during laboratory passage. Identification of the previously described ST gene alleles (28) in each of the ETEC genomes demonstrated that the estA2 allele was present in all but three of the CFA/I ETEC isolates whereas the CS6 ETEC genomes carried estA3, estA4, estA5, or estA7 alleles (see Fig. S1 in the supplemental material). Interestingly, the estA2 allele was also identified in five CS6 ETEC isolates, and all of these ETEC isolates were present in an undesignated ETEC lineage ( Fig. 1; see also There were two or more CFs identified in 88% (236/269) of the genomes, with 90% (146/162) of the CFA/I isolates and 84% (90/107) of the CS6 isolates carrying additional CFs (Table S1). Interestingly, CS21 (29,30) was identified in 88% (142/162) of the CFA/I genomes, compared with only 29% (31/107) of the CS6 genomes (P value of Ͻ0.001) ( Table 2). The genes encoding CS5 were identified in 53% (57/107) of the CS6 genomes and in none of the CFA/I genomes (P value of Ͻ0.001) ( Table 2). Minor CFs (CS2, CS3, CS4, CS14, and CS22) were identified in Յ5 of the CFA/I and CS6 ETEC genomes ( Table 2). Additional virulence genes were also detected that encode predicted proteins involved in adhesion to the host surface, including genes encoding the autotransport-  . 3; see also Text S1 in the supplemental material). Comparison of complete genomes reveals geographic variation among CFA/I and CS6 ETEC isolates. Based on epidemiological data and laboratory-based characterizations, we selected 26 ST-only ETEC isolates for complete genome sequencing to provide additional insight into the diversity of plasmids and other genomic regions in these isolates, as well as to further inform the selection of candidate challenge strains for use in human volunteer challenges (Table S3). These ETEC isolates met the following selection criteria making them potential candidates as future challenge strains: (i) they were associated with moderate to severe diarrhea in humans; (ii) they encoded only ST and not LT; (iii) they encoded CS6 or CFA/I; (iv) they were not of serogroup O39, O71, O78, or O141, which are represented by current whole-cell ETEC vaccine candidates which are in advanced clinical development (35)(36)(37)(38); and (v) they were susceptible to a panel of eight commonly used antibiotics (azithromycin, ampicillin/sulbactam, cefazolin, ceftriaxone, ciprofloxacin, levofloxacin, tetracycline, and trimethoprimsulfamethoxazole) (Tables S3 and S4). The CFA/I and CS6 isolates that qualified for additional genome sequencing included six CFA/I and 20 CS6 isolates, which were isolated between 1974 and 2012 in eight different countries (Table S3). These isolates represented 11 MLST sequence types and 11 serotypes and belonged to seven of the ETEC phylogenomic lineages (Table S3). Western blot analysis verified the production of CFA/I and CS6 by these isolates, while the hemagglutination assay verified the activity of CFA/I (Table S3).
Comparison of a representative complete genome from each of the three dominant CS6 phylogenomic lineages and the three dominant CFA/I lineages demonstrated that these genomes have plasmid and chromosomal regions that exhibit lineage and geographic specificity (  Table S5A to F). There were multiple genome regions identified in CFA/I isolate 11573 a-1 from lineage L15 that were absent from the genomes of isolates from other CFA/I lineages and in some cases were also missing from isolates belonging to the same lineage that were from different geographic locations ( Fig. 4; see also Table S5C). One of the genome regions that was present in the lineage L15 genomes from Chile (11573 a-1, 10754 a-1, and 10802 a) but absent or had divergent similarity in the representative lineage L15 genomes from Mozambique (300252 and 320116), India (500469), Bangladesh (600609), and Pakistan (700384 and 710903) consisted of genes involved in O-antigen biosynthesis (EC11573a1_358 to EC11573a1_370) ( Table S5C). The three lineage L15 CFA/I ETEC isolates from Chile (11573 a-1, 10754 a-1, and 10802 a)  among the L15 genomes but absent from the representative CFA/I genomes of L3 and L6, which included putative genes involved in flagellum biosynthesis (EC11573a1_2179 to EC11573a1_2218) ( Fig. 4; see also Table S5C). Distribution of a conserved CFA/I-encoding plasmid and multiple unique CS6encoding plasmids. The CFA/I and STh genes were colocated on the same plasmid in all six of the complete CFA/I genomes (Table S3). These plasmids ranged in size from 88.8 to 101.6 kb, had the IncFII(AY458016) replicon, and also carried the eatA gene (Table S3), which encodes the serine protease autotransporter EatA (31). In silico detection of STh, CFA/I, and EatA plasmid p11573a1_92 from ETEC isolate 11573 a-1 demonstrated that this plasmid was highly conserved among all of the CFA/I ETEC isolates examined in this study (Fig. 5). The CFA/I ETEC genomes also contained an IncFIB plasmid that ranged in size from 46.6 to 155.8 kb and carried genes encoding CS21 (29,30) (Table S3). Interestingly, the CS21 genes were identified in 88% (142/162) of the CFA/I genomes compared to only 29% (31/107) of the CS6 genomes (P value of Ͻ0.001) ( Table 2). The genes of CS21-encoding plasmid p11573a1_46 from ETEC isolate 11573 a-1 were identified in nearly all of the L6 and L15 CFA/I ETEC genomes; however, a region of the CS21 plasmid with approximately 17 genes, encoding mostly hypothetical proteins, was absent from the L3 CFA/I genomes and also from the CS6 genomes that encode CS21 (Fig. S3A).
In contrast to the conserved CFA/I ϩ STh plasmid that was identified, three unique ST ϩ CS6 plasmids were identified among the CS6 ETEC genomes (Table S3;   PCR-verified CS6 ETEC isolates (600468 and 720632) were missing the CS6 genes from their complete genome assemblies, but each had an STh-encoding plasmid (Table S3). The three unique plasmids that encoded both ST and CS6 also exhibited lineage specificity, with one ST ϩ CS6 plasmid detected only in the lineage L5 CS6 ETEC genome (Fig. S3B), and a second ST ϩ CS6 plasmid in the CS6 ETEC genomes of lineages L4 and L8 (Fig. S3C). The third ST ϩ CS6 plasmid encoded STp rather than STh and was identified only in ETEC isolate 214-4 (Fig. S3D). Interestingly, the four complete genomes that had STh and CS6 genes on two separate plasmids were identified in a single undesignated lineage of phylogroup A ( Fig. 1; see also Table S3). In silico detection of the STh (p503046_85) and CS6 (p503046_80) plasmids demonstrated that both of these plasmids were present in all five of the ETEC genomes of this lineage (503046, 702582, 503458, 520873, and 510016) ( Fig. S3E and F). These plasmids were not present in any of the other ETEC genomes analyzed ( Fig. S3E and F), demonstrating that two unique plasmids were involved in the acquisition of STh and CS6 by ETEC isolates of this novel ETEC lineage. Identification of the ST genes among the genomes of this lineage demonstrated that the ST plasmid of these CS6 ETEC genomes contained the estA2 allele, which is typically carried by the CFA/I ETEC (Fig. S1).

DISCUSSION
Previous studies, including the case-control GEMS, demonstrated that ST-only ETEC strains are among the leading causes of severe diarrheal illness among children and are more often associated with severe illness than ETEC strains that encode only LT (2,7,9,67). Thus, in the current study we investigated whether there are genomic or phenotypic differences among the dominant CF types (CS6 and CFA/I) of the ST-only ETEC strains. Phylogenomic analysis demonstrated that a majority of the CFA/I ETEC and CS6 ETEC strains occur in six distinct lineages, although they were identified in up to 13 previously described ETEC lineages in all, as well as additional undefined lineages, revealing that genomically diverse E. coli strains have acquired the genes encoding ST and either CFA/I or CS6. Previous comparative genomics studies have demonstrated an association of particular toxins and CFs with different lineages of ETEC (10,20,21,(39)(40)(41). Similarly, we observed an association of ST and certain CFs with the previously designated ETEC lineages; however, we also determined that a number of noncanonical ETEC virulence factors, including autotransporters and secretion systems, exhibited lineage specificity. In some cases, the noncanonical virulence genes exhibited a greater association with their dominant CF type (CFA/I or CS6) than with their lineage, suggesting that certain noncanonical virulence genes are colocated with the CF genes on plasmids or other mobile elements. Interestingly, gene-based comparisons of the CFA/I and CS6 ETEC isolates identified phylogroup and lineage-specific genes but also demonstrated there was geographic specificity in the genome content among isolates belonging to the same lineage. Many of the variable regions in the CFA/I and CS6 ETEC genomes contained genes associated with phage or transposable elements, highlighting the role of mobile elements in the ongoing diversification of the CFA/I and CS6 ETEC strains (and most likely all ETEC strains).
By generating complete genome sequences of selected CFA/I and CS6 ETEC isolates, we were also able to describe plasmids that encode ST and CS6 or CFA/I. Interestingly, the STh-and CFA/I-encoding plasmids were highly conserved among the CFA/I ETEC isolates analyzed in this study, suggesting that the CFA/I ST-only ETEC lineages most likely arose by the acquisition of this conserved virulence plasmid by multiple genomically diverse E. coli lineages. In contrast, the completed CS6 ETEC genomes have several unique ST and/or CS6-encoding plasmids, which have been acquired by multiple genomically diverse E. coli lineages. Interestingly, functional characterization demonstrated that CS6 ETEC isolates of different lineages that have unique virulence plasmids also exhibited significant differences in their ST production. Further investigation is necessary to determine whether plasmid or chromosomal genes are contributing to differences in ST production and, if so, whether this results in differences in illness severity associated with these ST-only ETEC isolates.
In summary, our findings demonstrate that while the majority of the CFA/I ST-only ETEC and CS6 ST-only ETEC analyzed were present in a limited number of dominant lineages, the genes encoding ST, CFA/I, and CS6 had been acquired by genomically diverse ETEC by the dissemination of a highly conserved CFA/I-encoding plasmid and several different versions of a CS6-encoding plasmid. Furthermore, variation was identified in the genome content of the CFA/I ETEC and CS6 ETEC isolates that was associated with geographic location of isolation, phylogroup, or lineage, demonstrating that selected populations of ST-only ETEC strains have undergone additional diversification following the acquisition of the ST and CF genes. There is currently no approved vaccine for disease caused by ST-only ETEC, or by any ETEC strain for that matter, and as such, the current report provides functional verification of ST and CF production, antimicrobial susceptibility testing data, and an in-depth genomic characterization of isolates that could serve as representatives of CFA/I-or CS6-encoding ST-only ETEC strains for future studies of ETEC pathogenesis, vaccine studies, and/or clinical trials. These isolates will be further functionally investigated for differences in their gene content that influences ST production and are planned to be developed as potential challenge isolates for use in evaluating future vaccine candidates.
Serogroups. The O antigen was determined as described previously by Guinée et al. (47) using antisera that identify O antigen serogroups O1 to O185. Isolates that did not react with O antisera were classified as nontypeable (ONT). All antisera were obtained and adsorbed with the corresponding cross-reacting antigens to remove nonspecific agglutinins.
Production and activity of CFA/I and CS6. Whole-cell lysates were prepared from ETEC isolates grown on CFA agar (CFA/I ETEC) or in lysogeny broth (LB) (CS6 ETEC), normalized according to optical density at 600 nm (OD 600 ), and mixed 1:1 with 2ϫ Laemmli buffer. Samples were electrophoresed by 15% SDS-PAGE, and proteins were transferred to polyvinylidene difluoride (PVDF) membranes (Millipore Corp., Bedford, MA). The membranes were probed with rabbit anti-CFA/I or anti-CS6 antibody (Rockland, Limerick, PA). Western immunoblots were developed using an Odyssey system (Li-Cor Biosciences, Lincoln, NE). Positive controls included purified protein samples of CFA/I or CS6 (BEI Resources, Manassas, VA).
The ability of CFA/I-expressing ETEC to hemagglutinate (HA) human type A red blood cells (RBC) was assessed. Duplicate samples of ETEC isolates grown on CFA agar were resuspended to an OD 600 of 2.0 and serially diluted 2-fold in phosphate-buffered saline (PBS) in a 96-well plate. An equal volume of washed human type A RBC was added to each well. Equal volumes of 0.1 M D-(ϩ) mannose-0.15 M NaCl were added to all wells. Plates were incubated for 2 h at 4°C. The hemagglutinin (HA) titer of each isolate was read as the dilution at which the RBC pellet did not form at the bottom of the well.
ST production. Selected ETEC isolates were grown overnight in LB and were used to inoculate chemically defined 4AA medium at a 1:100 dilution and were incubated overnight at 37°C and 250 rpm. 4AA medium is a chemically defined medium that has been used successfully for ST expression and subsequent purification (23,48). The following morning, the culture OD 600 was recorded, 1 ml of each culture was centrifuged at 13,000 rpm for 10 min, and 800 l of supernatant was immediately divided into aliquots, placed into 2.0-ml glass screw vials, and frozen at Ϫ20°C until assayed for ST activity via the cGMP assay. Human T84 colonic epithelial cells were purchased from the American Type Culture Collection (ATCC) (catalog no. CCL-248) and were cultured in ATCC's 1:1 Dulbecco's modified Eagle's medium and Ham's nutrient mixture F-12 (DMEM-F-12; Gibco catalog no. 11320033) containing 2.5 mM L-glutamine, 15 mM HEPES, and 0.5 mM sodium pyruvate and supplemented with 5% fetal bovine serum (FBS). All cell cultures were supplemented with antibiotic-antimycotic (Gibco). Confluent T84 cells were harvested from T-75 culture flasks using 0.25% trypsin and resuspended in DMEM-F-12 medium. T84 cells were seeded into 24-well, flat-bottom cell culture plates (Corning Costar, Cambridge, MA) at a density of 5 ϫ 10 5 cells per well and grown to confluence. Intracellular cGMP levels were determined as previously described (24). The amount of ST produced by the ETEC isolates was calculated relative to the amount of cGMP produced by the 10 ng of purified ST-positive control. Statistical differences in the mean levels of ST production by ETEC isolates associated with the colonization factor type (CFA/I or CS6) or from different lineages were determined with R v.3.4.1 using the F test of variance and the two-sample t test.
Genome sequencing and assembly. Genomic DNA of each ETEC isolate was extracted from overnight cultures using a Sigma GenElute bacterial genomic DNA kit (Sigma-Aldrich; St. Louis, MO). The genomes were sequenced using paired-end 500-bp insertion libraries and an Illumina HiSeq 4000 system. The 150-bp Illumina reads were assembled using SPAdes v.3.7.1 (49), and the final assemblies were filtered to contain only contigs that were Ն500 bp in length and had Ն5ϫ k-mer coverage. The assembly metrics are provided in Table S1 in the supplemental material. Additional long-read genome sequencing was performed on a Pacific Biosciences RS II platform (PacBio) as previously described (50,51). The characteristics of the complete assemblies are listed in Table S3.
In silico multilocus sequence typing, serotyping, and detection of antibiotic resistance genes. The seven genomically conserved housekeeping loci (adk, gyrB, fumC, icd, mdh, purA, and recA) of the multilocus sequence typing (MLST) scheme previously developed by Wirth et al. (52) were identified in each of the genomes listed in Table S1 as previously described (51). These genes are used to examine the population structures of the compared E. coli isolates. The serotypes were predicted using Serotype Finder v. 1.1 (https://cge.cbs.dtu.dk/services/SerotypeFinder/) (53). Antibiotic resistance genes were identified in each of the ETEC genomes using resistance gene identifier (RGI) v.3.2.0 of the comprehensive antibiotic resistance database (CARD) (54) as previously described (50,51).
Phylogenomic analysis. The 269 CFA/I and CS6 ETEC genomes analyzed in this study were compared with 61 previously sequenced ETEC reference genomes (Table S1) and 31 diverse E. coli and Shigella genomes (55) using a single nucleotide polymorphism (SNP)-based approach as previously described (56,57). There were 204,335 conserved SNP sites among these genomes relative to the reference E. coli IAI39 genome (GenBank accession no. NC_011750.1). The concatenated SNP sites were used to infer a maximum likelihood phylogeny with RAxML v7.2.8 (58), using the GTR model of nucleotide substitution, the GAMMA model of rate heterogeneity, and 100 bootstrap replicates. The phylogeny was labeled using interactive Tree Of Life software (iTOL v.3) (59).
Gene-based comparisons. Differences in gene content among the CS6 ETEC and CFA/I ETEC isolates were identified using BLASTN large-scale BLAST score ratio (LS-BSR) analysis as previously described (60,61). The protein-coding genes of each genome were assigned to gene clusters with Ն90% nucleotide identity and Ն90% alignment length using CD-HIT v. 4.6.7 (62) (see Data Set S1 in the supplemental material). Gene clusters identified with a BSR of Ն0.9 were considered to represent significant similarity, while gene clusters with a BSR of Ͻ0.4 were considered absent.
In silico detection of E. coli virulence genes and plasmids. E. coli and Shigella virulence genes were identified in the ETEC genomes also using BLASTN LS-BSR as previously described (60,61). The association of virulence genes among the CFA/I ETEC and CS6 ETEC genomes was analyzed for statistical significance using Pearson's chi-square test with Yates' continuity correction or Fisher's exact test using R v.3.4.1. The clustered heat maps were generated using the heatmap2 function of gplots v. 3.0.1 in R v.3.3.2 and the complete linkage method with Euclidean distance estimation. Plasmid incompatibility types in the PlasmidFinder v.1.3 database (63) were identified in each of the ETEC genomes using BLASTN LS-BSR (60,61). Plasmids in each of the complete genomes were annotated using an in-house annotation pipeline (64,65). The predicted protein-coding genes of selected plasmids were detected in each of the ETEC genomes using BLASTN LS-BSR and were visualized as a clustered heat map as described above.
The sequences of the ST genes from each ETEC genome were compared with previously described estA reference sequences (28). The estA nucleotide sequences were aligned using ClustalW, and a phylogeny was constructed using the maximum likelihood method with the Kimura 2-parameter model and 1,000 bootstraps using MEGA7 (66), and the results were labeled using iTOL (59).
Data availability. The ETEC genome assemblies were deposited in GenBank under the accession numbers listed in Table S1.