High-Throughput Metabolic Network Analysis and Metatranscriptomics of a Cosmopolitan and Streamlined Freshwater Lineage

12 An explosion in the number of available genome sequences obtained through 13 metagenomics and single-cell genomics has enabled a new view of the diversity of 14 microbial life, yet we know surprisingly little about how microbes interact with each other 15 or their environment. In fact, the majority of microbial species remain uncultivated, while 16 our perception of their ecological niches is based on reconstruction of their metabolic 17 potential. In this work, we demonstrate how the “seed set framework”, which computes 18 the set of compounds that an organism must acquire from its environment (Borenstein 19 et al., 2008), enables high-throughput, computational analysis of metabolic 20 reconstructions, while providing new insights into a microbe’s metabolic capabilities, 21 such as nutrient use and auxotrophies. We apply this framework to members of the 22

ubiquitous freshwater Actinobacterial lineage acI, confirming and extending previous 23 experimental and genomic observations implying that acI bacteria are heterotrophs 24 reliant on peptides and saccharides. We also present the first metatranscriptomic study 25 of the acI lineage, featuring high expression of transport proteins and the light-26 harvesting protein actinorhodopsin, and confirming predictions of nutrients and essential 27 metabolites while providing additional support to the hypothesis that members of the acI 28 are photoheterotrophs. All computational steps were implemented using Python scripts, freely available 167 as part of the reverseEcology Python package developed for this project 168 (https://pypi.python.org/pypi/reverseEcology/, DOI:######). 169

Identification of Transported Compounds 170
For each genome, we identified all transport reactions present in its metabolic 171 network reconstruction. Gene-protein-reaction associations (GPRs) for these reactions 172 were manually curated to remove unannotated proteins, group genes into operons (if 173 applicable), and to identify missing subunits for multi-subunit transporters. These genes 174 were then mapped to their corresponding COGs, and grouped accordingly. Finally, the 175 most common annotation for each COG was used to identify likely substrates for each 176 of these groups. 177

Protein Clustering, Metatranscriptomic Mapping, and Clade-Level Gene Expression 178
OrthoMCL (Li et al., 2003) Figure 1. Previous phylogenetic analysis using 16S rRNA gene 213 sequences have revealed that the acI lineage can be grouped into three distinct 214 monophyletic clades (Newton et al., 2011). In this study, the phylogenetic tree built from 215 37 concatenated marker genes also identified three monophyletic branches, enabling 216 MAGs to be classified as clade acI-A or acI-B based on the taxonomy of SAGs within 217 each branch. Note that three MAGs formed a monophyletic group separate from clades 218 acI-A and acI-B; we assume these genomes belong to clade acI-C as no other acI 219 clades have been identified to date. 220

Estimated Completeness of Tribe-and Clade-Level Composite Genomes 221
Metabolic network reconstructions created from acI SAGs and MAGs will likely 222 be missing reactions, as the underlying genomes are incomplete (Table 1) genes ( Figure 2B). As a result, seed compounds were calculated for composite clade-12 level genomes, with the understanding that some true seed compounds for the acI-C 244 clade will not be predicted. 245

Computation and Evaluation of Potential Seed Compounds 246
Seed compounds were computed for each clade, using the composite metabolic 247 network graph for that clade (Figure 3, and Figures S1 to S3). A total of 125 unique 248 seed compounds were identified across the three clades (Table S2). Additional details 249 are available in the Supplemental Online Material. 250 Seed compounds were predicted using the results of an automated annotation 251 pipeline, and as such are likely to contain inaccuracies (e.g., due to missing or incorrect 252 annotations). As a result, we screened the set of predicted seed compounds to identify 253 those that represented biologically plausible auxotrophies and nutrients, and manually 254 curated this subset to obtain a final set of auxotrophies and nutrient sources. shows the auxotrophies and nutrients these compounds represent. 274

Making Sense of Seed Compounds via Protein Clustering and Metatranscriptomic 275
Mapping 276 With regards to seed compounds representing nutrient sources, genes 277 associated with the consumption of these compounds should be expressed. However, 278 because seed compounds were computed from each clade's composite metabolic 279 network graph, genes associated with the consumption of seed compounds may be 280 present in multiple genomes within the clade. To facilitate the linkage of 281 metatranscriptome measurements to seed compounds, we decided to map 282 metatranscriptome samples to clusters of orthologous groups (COGs) within each 283 clade. We used OrthoMCL (Li et al., 2003) to identify COGs in the set of acI genomes, 284 and counted each COG as present in a clade if that COG was present in at least one 285 genome belonging to that clade. We then used BBMap to map metatranscriptome reads 286 to our reference genome collection, and counted the unique reads which map to each 287 14 Sequencing of cDNA from all four rRNA-depleted metatranscriptome samples 289 yielded approximately 160 billion paired-end reads. After merging, filtering, and further 290 in-silico rRNA removal, approximately 81 billion, or 51% of the reads remained (Table  291   S1). OrthoMCL identified a total of 5013 protein clusters across the three clades (Table  292 S3). The COGs were unequally distributed across the three clades, with clade acI-A 293 genomes containing 3175 COGs (63%), clade acI-B genomes containing 3459 COGs 294 (69%), and clade acI-C genomes containing 1365 COGs (27%). After mapping the 295 metatranscriptomes to our acI genomes (Table S4), we identified 650 COGs expressed 296 in clade acI-A, 785 in clade acI-B, and 849 in clade acI-C (Table S5). Among expressed 297 genes, the median log2 average RPKM value was 10.3 in clade acI-A, 10.2 in clade acI-298 B, and 9.0 in clade acI-C. Thus, despite differential abundance of each clade within the 299 lake, median gene expression within each clade was similar. 300 Auxotrophies and Nutrient Sources of the acI Lineage 301 Seed set analysis yielded seven auxotrophies that could be readily mapped to 302 ecophysiological attributes of the acI lineage ( Figure 4a). In all three clades, beta-303 alanine was identified as a seed compound, suggesting an auxotrophy for pantothenic 304 acid (Vitamin B5), a precursor to coenzyme A formed from beta-alanine and pantoate. 305 In bacteria, beta-alanine is typically synthesized via aspartate decarboxylation, and we 306 were unable to identify a candidate gene for this enzyme (aspartate 1-decarboxylase, 307 E.C. 4.1.1.11) in any acI genome. Pyridoxine 5'-phosphate and 5'-pyridoxamine 308 phosphate (forms of the enzyme cofactor pyridoxal 5'-phosphate, Vitamin B6) were also 309 predicted to be seed compounds, and numerous enzymes in the biosynthesis of these 310 compounds were not found in the genomes. 311 15 Clades within the acI lineage also exhibited distinct auxotrophies. Clade acI-A 312 was predicted to be auxotrophic for the cofactor tetrahydrofolate (THF or Vitamin B9), 313 and numerous enzymes for its biosynthesis were missing. This cofactor plays an 314 important role in the metabolism of amino acids and vitamins. In turn, clade acI-B was 315 predicted to be auxotrophic for adenosylcobalamin (Vitamin B12), containing only a 316 single reaction from its biosynthetic pathway. Finally, acI-C was predicted to be 317 auxotrophic for the nucleotide uridine monophosphate (UMP, used as a monomer in 318 RNA synthesis) and the amino acids lysine and homoserine. In all cases multiple 319 enzymes for the biosynthesis of these compounds were not found in the acI-C 320 genomes. However, with the exception of adenosylcobalamin, we did not identify 321 transporters for any of these compounds. Furthermore, because the acI-C composite 322 genome was estimated to be around 75% complete, we cannot rule out the possibility 323 that the missing genes might be found in when additional genomes are recovered. 324 A number of seed compounds were predicted to be nutrients, compounds which 325 can be degraded by members of the acI lineage ( Figure 4B). Both clades acI-A and acI-326 B were predicted to use D-altronate and trans-4-hydroxy proline as nutrients, and acI-B 327 was additionally predicted to use glycine betaine. These compounds indicate that the Finally, all three clades were predicted to use di-peptides and the sugar maltose 334 as nutrients. Clades acI-A and acI-C were also predicted to consume the 335 polysaccharides stachyose, manninotriose, and cellobiose. In all cases, these 336 compounds were associated with reactions catalyzed by peptidases or glycoside 337 hydrolases (Table S8 and S9), which may be capable of acting on compounds beyond 338 the predicted seed compounds. Thus, we used these annotations to define nutrient 339 sources, rather than using the predicted seed compounds themselves. Among these 340 nutrient sources were di-and polypeptides, predicted to be released from both 341 cytosolic-and membrane-bound aminopeptidases. As discussed below, we identified a 342 number of transport proteins capable of transporting these released residues. In Lake 343 Mendota, these aminopeptidases were expressed in clades acI-A and acI-B at around 344 70% of the median gene expression levels, while they were expressed at up to twice the 345 median in clade acI-C (Table S8). These findings agrees with MAR-FISH and CARD-346 FISH studies that confirm the ability of acI bacteria to consume a variety of amino acids 347 (Salcher et al., 2010(Salcher et al., , 2013. 348 All three clades were predicted to encode an alpha-glucosidase, which in Lake 349 Mendota was expressed most strongly in clade acI-C, at approximately 116% of the 350 median (Table S9). Clades acI-A and acI-C also encode a beta-glucosidase, but it was 351 not expressed, at least under prevailing environmental conditions. Both of these 352 enzymes release glucose monomers, which acI is known to consume (Buck et al., 2009; 353 Salcher et al., 2013). Furthermore, these two clades encode an alpha-galactosidase 354 and multiple maltodextrin glucosidases (which free maltose from maltotriose), but these 355 were only expressed in clade acI-C during our sampling period. The alpha-356 galactosidase had a log2 average RPKM expression value of 2.5 times the median, 357 while the maltodextrin glucosidases were expressed at approximately 20% of the 358 median (Table S9). 359

Compounds Transported by the acI Lineage 360
Microbes may be capable of transporting compounds that are not strictly required 361 for growth, and comparing such compounds to predicted seed compounds can provide 362 additional information about an organism's ecology. Thus, we used the metabolic 363 network reconstructions for the acI genomes to systematically characterize the transport 364 capabilities of the acI lineage. 365 All acI clades encode for and expressed a diverse array of transporters ( Figure 5, 366 Tables S10 and S11, and the Supplemental Online Material). Consistent with the 367 presence of peptidases, all clades contain numerous genes for the transport of peptides 368 and amino acids, including multiple oligopeptide and branched-chain amino acid 369 transporters, as well as two distinct transporters for the polyamines spermidine and 370 putrescine. All clades also contain a transporter for ammonium. As averaged over the 371 24-hour sampling period, the ammonium, branched-chain amino acid, and oligopeptide 372 transporters had expression values above the median, with expression values for the 373 substrate-binding protein (of the ATP-binding cassette (ABC) transporters) ranging from 374 2 to 325 times the median (Table S10). In contrast, while all clades expressed some 375 genes from the polyamine transporters, only clade acI-B expressed the 376 spermidime/putrescine binding protein, at approximately 75 times the median (Table  377 S10). Additionally, clade acI-A contains a third distinct branched-chain amino acid 378 transporter, composed of COGs not found in clades acI-B or acI-C. This transporter was 18 not as highly expressed as the shared transporters, with the substrate-binding protein 380 not expressed at all (Table S10). Finally, clades acI-A and acI-B also contain a 381 transporter for glycine betaine, which was only expressed in clade acI-A, at 382 approximately 35 times the median (Table S10). However, because these observations 383 were made at a single site at a single point in time, we cannot rule out the possibility 384 that the expression of these transporters changes with space and time. 385 All clades also strongly expressed transporters consistent with the presence of 386 glycoside hydrolases, including transporters for the sugars maltose (a dimer of glucose) 387 and xylose, with expression values for the substrate-binding protein ranging from 3 to 388 144 times the median (Table S10). Clades acI-A and acI-B also contain four distinct 389 transporters for ribose, although the substrate-binding subunit was not expressed at the 390 time of sampling (Table S10). 391 Representatives from the acI lineage also encode and expressed a number of 392 transporters that do not have corresponding seed compounds, including a uracil 393 permease, and a xanthine/uracil/thiamine/ascorbate family permease, both of which are 394 expressed at levels ranging from 11 to 127 times the median (Table S10) (Table S10). Though not strictly annotated as 398 such, all three of these transporters may be responsible for the uptake of the seed 399 compound UMP. In addition, clade acI-A contains but did not express a transporter for 400 cobalamin (Vitamin B12), and both clades acI-A and acI-B contain but did not express 401 transporters for thiamin (Vitamin B1) and biotin (Vitamin B7) (Table S10). Despite 402 predicted auxotrophies for Vitamins B5 and B6, we were unable to find transporters for 403 these two compounds. However, as identification and annotation of transport proteins is 404 an active area of research (Saier et al., 2014), transporters for these vitamins may yet 405 be present in the genomes. 406 Finally, all three clades expressed actinorhodopsin, a light-sensitive protein that 407 functions as an proton efflux pump (Sharma et al., 2008). In all clades, actinorhodopsin 408 was among the top seven most highly-expressed genes at the time of sampling (Table  409 S4), with expression values in excess of 300 times the median in all three clades (Table  410 S4). Given that many of the transport proteins are ABC transporters, we speculate that Combined, these results indicate that acI are photoheterotrophs, making a living 458 on a diverse array of N-rich compounds, sugars, oligo-and poly-saccharides, and light. 459 We hypothesize that the acI obtain peptides from the products of cell lysis, and may 460 participate in the turnover of high molecular weight dissolved organic compounds, such 461 as starch, glycogen, and cellulose. The acI lineage does not appear to be metabolically 462 self-sufficient, relying on other organisms for the production of essential nutrients. 463 This study also presents the first combined genomic and metatranscriptomic 464 analysis of a freshwater microbial lineage. Transport proteins were among the most 465 highly expressed in the acI genomes, and the expression of multiple amino acid 466 transporters may facilitate uptake of these labile compounds. We also observed 467 differences in the relative expression of these transporters, which may point to clade-468 specific differences in the affinity for these substrates. Finally, the actinorhodopsin 469 22 protein was highly expressed, and may facilitate synthesis of the ATP needed to drive 470 acI's many ABC-type transporters. 471 A close comparison of our predictions to previous studies of the acI lineage 472 reveals some important limitations of the seed set framework and automatic metabolic 473 reconstructions. First, the seed set framework only identifies compounds that the 474 metabolic network must obtain from its environment, and will fail to identify compounds 475 that the organism can acquire from its environment but can also synthesize itself.   TE02754.1703 glucose + ATP -> gluose-6-P + ADP + Pi rxn00558 TE02754.1688 glucose-6-P <--> fructose-6-P rxn00545 TE02754.2419 fructose-6-P + ATP --> fructose-1,6-P + ADP rxn00786 TE02754.1795 OR TE02754.2367 fructose-1,6-P <--> glyceraldehyde-3-P +glycerone-P rxn00747 TE02754.1685 glycerone-P <--> glyceraldehyde-3-P rxn00781 TE02754.1683 glyceraldehyde-3-P + NAD + Pi <--> 1,3-P-glycerate + NADH rxn01100 TE02754.1684 1,3-P-glycerate + ADP + Pi <--> 3-P-glycerate + ATP rxn01106 TE02754.1899 3-P-glycerate <-->