Functional Gene Array-Based Ultrasensitive and Quantitative Detection of Microbial Populations in Complex Communities

The rapid development of metagenomic technologies, including microarrays, over the past decade has greatly expanded our understanding of complex microbial systems. However, because of the ever-expanding number of novel microbial sequences discovered each year, developing a microarray that is representative of real microbial communities, is specific and sensitive, and provides quantitative information remains a challenge. The newly developed GeoChip 5.0 is the most comprehensive microarray available to date for examining the functional capabilities of microbial communities important to biogeochemistry, ecology, environmental sciences, and human health. The GeoChip 5 is highly specific, sensitive, and quantitative based on both computational and experimental assays. Use of the array on a contaminated groundwater sample provided novel insights on the impacts of environmental contaminants on groundwater microbial communities.

assigned to the corresponding legacy probe and excluded from further probe design steps. If a legacy probe showed potential cross hybridization to a non-target sequence, it was voided and removed from the probe collection and the corresponding target sequences were released and reused for probe design.
Probe design was performed using a new version of the CommOligo software (3). Two types of probes were designed: gene-specific (each probe targets one gene sequence); and group-specific (one probe targets two or more highly homologous sequences) (1). The newly designed candidate probes and legacy probes were searched against the NCBI nt/env_nt databases to verify specificity based on experimentally determined sequence similarity (≤ 90%), continuous stretch (≤ 20 bases) and free energy (≥ -35 cal/mol) criteria (1). Potentially non-specific probes were removed from further consideration. Multiple probes targeting the same sequence or group of sequences were designed, so CommOligo was used to rank the remaining probes (3), and only top ranked probes were used for array construction.
Microarray construction. Two major formats of the GeoChip 5.0 array were developed. The smaller format (GeoChip 5.0S) has ~60,000 probes per array (See Table S1 for details). For delineating experimental parameters, several modified versions of GeoChip 5.0S were constructed that included perfect match (PM) and mismatch (MM) probes from different pure cultures. The larger format (GeoChip 5.0M) has ~180,000 probes per array (Table S1). All GeoChip 5.0 microarrays were manufactured by Agilent (Santa Clara, CA, USA) using either the 8x60K (8 arrays per slide) or the 4x180K (4 arrays per slide) format. Each GeoChip array was evenly divided into 96 (8×12 grids) subarrays for 5.0S, and 256 (8×32 grids) subarrays for 5.0M. Each subarray has sixteen 16S control probes and five common oligonucleotide reference standard probes (CORS) (4) at specific positions. The 16S control probes were split into two groups of 8 and were placed at the top and bottom of each subarray. CORS probes were placed in the central region of each subarray. Additionally, each subarray had 2 or 3 randomly placed Agilent negative control probes. The hyperthermophile control probes and functional gene probes were randomly placed across the array in the remaining available spaces. ratio [SNR; (probe signal-background)/background SD] was calculated. As suggested by Agilent, the average signal of Agilent's negative control probes within each subarray was used as the background signal for the probes in that subarray instead of the local background typically used.
If all negative control probes within a given sub-array failed to yield a valid signal, the mean background signal intensity from an adjacent sub-array was used instead. The signal intensity for each spot was corrected by subtracting the background signal intensity. If the net difference was <0, the spots were excluded from subsequent analysis.
A two-step data normalization and quality filtering method was performed for all arrays in an experiment (4,11). First, the average signal intensity of CORS was calculated for each subarray, and the maximum average value among all subarrays was applied to normalize the signal intensity of samples in each array. Second, the sum of the signal intensity was calculated for each array, and the maximum sum value was applied to normalize the signal intensity of all spots in each array, which produced a normalized value for each spot in each array. Detailed descriptions of the optimized GeoChip sample preparation, hybridization, imaging and normalization methods and reagents and equipment needed is in (12). Statistical analysis. Various statistical methods were used for analyzing the GeoChip data.
Three different nonparametric multivariate analysis methods, ADONIS (permutational multivariate analysis of variance using distance matrices), ANOSIM (analysis of similarities) and MRPP (multi-response permutation procedure), and detrended correspondence analysis (DCA), were used to measure the overall differences of community functional gene structure (13). The microbial community functional gene diversity was estimated using Shannon Index, Simpson Index and functional gene richness. Pearson correlation coefficient was used for testing the dependence among environmental factors. A dendrogram tree of environmental factors was constructed based on Manhattan distances using Ward's minimum variance method (14), and was cut into hierarchical clusters using the cutree method (15). Canonical correspondence analysis (CCA) and partial CCA were used for analyzing statistical linkages between the functional gene structure and environmental variables. Variation partitioning analysis (VPA) was used to divide and assign the variance in microbial functional gene structure among samples. A forward selection procedure was used to build a stepwise CCA model (16). Briefly, at each step the model is extended by adding an additional variable to maximize the explanatory power of the model. The procedure is automatically terminated when (i) the explanatory power of the model starts to decrease, (ii) the permutation test fails after a variable is added or (iii) all variables are included. Welch's t-test was used to test the significance of differences in functional gene richness and alpha diversity between paired groups of samples without assuming unequal variances. All statistical analyses were performed in R (version 3.4.4, 2018-03-15) using packages stats, ape and vegan.

SELECTION OF GENE FAMILIES FOR GEOCHIP 5.0 FABRICATION
All functional gene families from previous GeoChips (a total of 410) were updated. Detailed rationale for selecting these gene families and categories was previously provided (1,2,(17)(18)(19).
All gene families from previous versions (all genes submitted prior to June 2009) were manually updated. The keyword queries, alignment cutoff values, and seed sequences were modified as necessary to increase sequence coverage and accuracy based on the NCBI databases available in July 29, 2013. During this update, some gene families were combined or separated based on newly discovered gene families or increased sequence availability. For example, twelve dioxygenase gene families from GeoChip 4 were combined into three gene families due to similarities in the sequences of these families; norB was spilt into two gene families to differentiate a new subgroup discovered after design of GeoChip 4. GeoChip 5.0 greatly expanded overall gene and sequence coverage by adding more than 1,000 new gene families and covered a total of 1,447 gene families involved in carbon (135 genes), nitrogen (28 genes), sulfur (27 genes), and phosphorus (7 genes) cycling, antibiotic resistance (19 genes), stress response (103 genes), microbial defense (65 genes), metabolic pathways (4 genes), plant growth viruses (115 genes), virulence (605 genes), metal homeostasis (119 genes), organic contaminant degradation (105 genes), pigments (30 genes) and electron transfer (11 genes) ( Table 1). Detailed descriptions for all selected functional gene families is provided below and Table S1.
Several of the updated and expanded categories have been discussed in detail in other publications, including virulence genes (20) and stress response (21,22) genes. In addition, microbial defense, plant growth promotion, pigments and protist phylogenetic markers were added to increase coverage of functional processes of importance to various ecological and environmental processes.
Here we provide an overview of these gene families.

Categories for geochemical cycling
Microorganisms play key roles in geochemical cycling, including carbon, nitrogen, phosphorous, sulfur. Genes for these cycles were included on GeoChip 5. Many of these genes were also on versions 3 and 4 but have been greatly expanded and refined in this updated version.

Carbon cycling
Genes in the carbon cycling category include those for carbon degradation, carbon fixation, and methane cycling (methane oxidation and methanogenesis).
Microorganisms are capable of degrading a wide variety of carbon sources. To cover some of that diversity, the number of carbon degradation genes covered on GeoChip 5.0 was greatly expanded.
GeoChip versions 2-4 included genes for degradation of cellulose, lignin, chitin, starch, hemicellulose, and pectin (1,11,23). In addition to those genes, genes for degradation of terpenes and genes for the glyoxylate cycle were included. Chitin is a polysaccharide present in many organisms including fungi, crustaceans, and insects.
The two original genes (acetylglucosaminidase and chitinase) involved in chitin degradation were updated and chitinase was split into 3 sub-categories based on where the enzyme attacks (endo-, exo-, or undefined). Chitinases are found in any number of organisms including viruses (46).
Inulinase is currently the only degradative gene for this polysaccharide on GeoChip (53-55).
The glyoxylate cycle is involved in the conversion of acetyl-CoA to succinate for use in a number of biosynthetic pathways and may also play an important role in bacterial and fungal virulence. The key enzymes are isocitrate lyase (AceA) and malate synthase (AceB) (68,69).

CO2 fixation.
Carbon fixation converts inorganic carbon (CO2) into organic carbon that can be used by other organisms. Autotrophic CO2 fixation is "the most important biosynthetic process in nature" (70).
There are now six known pathways for autotrophic CO2 fixation (71). When earlier versions of GeoChip were designed, only five pathways were known (72), and a single enzyme from each of four of these pathways were covered. Here we sought to increase coverage by including additional enzymes from each cycle (total number of genes indicated in parentheses): Calvin cycle (9 genes) (73, 74), 3-hydroxypropionate bicycle (10 genes) (75), reductive acetyl-CoA pathway (2 genes) (76), reductive tricarboxylic acid cycle (8 genes) (77), and two new pathways: dicarboxylate/4-hydroxybutyrate cycle (12 genes) (78) and 3-hydroxypropionate/4hydroxybutyrate cycle (12 genes) (79). Only the reductive acetyl-CoA pathway did not have genes added as many of the enzymes in this particular cycle have dual functions in other normal cellular processes.
In addition to enzymes, genes for carboxysomes, bacterial microcompartments that aid in the concentration of CO2 (80-82), were also selected. The seven carboxysome genes cover shell proteins, which act as a CO2 diffusion barrier, concentration mechanisms and carbonic anhydrase isoforms. One specific protein for each of the alpha and beta carboxysomes was also included (83).

Methane metabolism:
Methane accounts for about 10% of all greenhouse gas emissions and is primarily produced during decomposition of organic matter (84). Methanogenesis is the process by which single-carbon substrates are reduced to produce methane and generate energy. This methane can in turn be oxidized to create CO2 and O2. There are four methanogenesis pathways, the core pathway, acetoclastic, hydrogenotrophic, and methyl-corrinoid. The core pathway is the final reduction step to produce methane via methyl-coenzyme M. The hydrogenotrophic pathway reduces CO2 to CO and then formate, which then feeds into the core pathway. The acetoclastic pathway reduces acetate to acetyl CoA. The acetyl CoA is then reduced in the final step of the hydrogenotrophic pathway.
The methyl-corrinoid pathway reduces substrates containing methyl groups and then feeds into the core pathway. Previously, only methyl-coenzyme M reductase (mcrA) was covered as an indicator for methanogenesis. Additional genes were added to cover the acetoclastic (3 genes), hydrogenotrophic (6 genes), and methyl-corrinoid (3 genes) pathways (85). Particulate and soluble methane monooxygenases, pmoA and mmoX, are included for methane oxidation.

Nitrogen Cycling
While N is critical to all living things, the largest N pool, N2, is extremely stable and requires a great deal of energy to reduce it to a state that is biologically available. In addition, most of the nitrogen in soil is biologically unavailable. While some nitrogen is available from minerals in the soil, nitrogen availability is largely controlled by microbial activity. Nitrogen-fixing bacteria convert atmospheric N2 into NH4 allowing it to be used by plants and other bacteria. Plant and animal decay releases NH3 via ammonification. The NH3/NH4 can be converted to NO2 then NO3 via nitrification; NO3 is converted back to N2 via denitrification. NO3 can also be reduced to NH2 via N reduction. NH4 and NO2 can be oxidized to N2 anaerobic ammonium oxidation (anammox). Each of these processes is covered on the GeoChip: N fixation (1 genes), nitrification (4 genes), denitrification (6 genes), ammonification (4 genes), dissimilatory (2 genes) and assimilatory (5 genes) N reduction to NH4, and anammox (2 genes) (2, 86).
Two types of nitrous oxide reductase genes were included: cnorB a cytochrome bc complex type enzyme and qnorB, a quinol-oxidizing single-subunit class (89). An assimilatory nitrate reductase (narB) from cyanobacteria was added as well. This gene is similar to nasA, but has a different nomenclature (90-95).
Additionally, 3 new genes for nitrogen assimilation by bacteria and fungi, including ammonium, nitrate and nitrite transporters were added.

Phosphorus cycling
Phosphorus plays an important role in biological life as a component of cellular components such as nucleotides, ATP and membranes. Excess phosphate (Pi) is frequently stored by all living organisms as polyphosphate chains that can contain hundreds of Pi residues. Polyphosphate has numerous functions from Pi storage, stress response, virulence, and maintenance of stationary phase (96). Two genes for polyphosphate biosynthesis (ppk and phytase) and one for polyphosphate degradation (exopolyphosphatase/ppx) were included on earlier versions of GeoChip. Polyphosphate kinase/ppk removes a Pi from ATP to lengthen poly P. Phytase is involved in the conversion of organic phosphorous to inorganic by hydrolyzing phytate.
Additional genes for oxidation of inorganic phosphate compounds found in the environment were added including hypophosphite dioxygenase (htxA), which oxidizes hypophosphite (97) and phosphonate dehydrogenase (ptxD), which oxidizes phosphonate (98,99). Two new genes for polyphosphate degradation were also added. These include a second form of polyphosphate kinase (ppk2) that creates either GTP from GDP or ADP from AMP (100, 101) and an endopolyphosphatase (ppn) that cleaves poly-P into various sized units.

Sulfur cycling
Sulfur is the ancient "motor of life" and played a similar role in ancient Earth as O2 plays now (102). In the current primarily aerobic environment, sulfur acts as both an electron acceptor and donor for anaerobic respiration and can be oxidized (102). The sulfate reduction/sulfide oxidation cycle is found in a variety of environments, such as freshwater and marine sediment or microbial mats, and the O2-H2S interface formed in this cycling often moves based on factors such as tides or amount of sunlight present (102,103). Previous versions of GeoChip included dsrA and dsrB for sulfite reduction, sox for sulfate oxidation, and three genes for adenylylsulfate reductase (2 for aprA and aprB). Additional genes involved in sulfur transformation reactions were included on this newest version. Dimethylsulfoniopropionate (DMSP), a major source of C and S in marine environments, is degraded by either cleavage by DMSP lyase or demethylated by DMSP demethylase (dmdA) (104). Cleavage of DMSP produces dimethylsulfide (DMS), a volatile compound which, when transported to the atmosphere and oxidized, can modulate the formation of clouds (105,106).
Demethylation of DMSP ultimately leads to release of additional C that can be further utilized by marine bacteria (104). The sox "gene" of previous GeoChips was split into five "active" component subunits: soxA, soxB, soxC, soxV, and soxY (107). CysI and J encode a sulfite reductase in the cysteine biosynthesis pathway (108).

Categories related to microbial response to environmental conditions
Metal Homeostasis: High concentrations of metals can be toxic to microorganisms. This type of contamination is common due to both anthropogenic and natural causes (109). To limit exposure and protect against damage from these metals, microorganisms have developed resistance mechanisms (110,111). Previous versions of GeoChip covered 44 genes/enzymes for resistance to 13 commonly detected metals with well-studies resistance mechanisms (Ag, Al, As, Cd, Co, Cr, Cu, Hg, Ni, Pb, Se, Te and Zn) (110)(111)(112) and one gene for bacterial metallothioneins and metallothioneinlike proteins (smtA).
The metal resistance category has been expanded to include additional resistance, metal uptake and maintenance mechanisms (72 additional genes) in addition to metal resistance and has been renamed metal homeostasis. Metal acquisition genes included those for ion pumps for several metals including boron (2 genes), calcium (2 genes), magnesium (3 genes), manganese (2 genes), potassium (6 genes), and sodium (8 genes). A number of genes related to iron scavenging, such as transporters, siderophores or siderophore receptors, storage, and oxidation (11 genes) were also included. Metal resistance gene families were updated to include transport and enzymatic transformation genes for resistance to arsenic (6 genes), copper (10 genes), mercury (9 genes), and chromium (2 genes). Nutrient metals can also be toxic at higher concentration, so uptake and efflux transporters were included for metals such as nickel (7 genes), cobalt (5 genes), copper (10 genes) and zinc (14 genes). A majority of the genes in this category are transporters, which is the most common metal resistance mechanism for bacteria (112). In addition, transports are involved in uptake of nutrient metals. Other mechanisms of metal resistance are enzymatic medication of the toxic metal or sequestration, so the metal is no longer biologically available. New genes include arsenic related genes arrA (respiratory arsenate reductase) (113) and arxA (anaerobic arsenite oxidase) (114), boron related genes bor1 (boron transporter) (115), atr1 (boron exporter from fungi) (116), calcium (chaA, calcium/proton antiporter) (117), cobalt/magnesium (corA) (118), and cobalt/nickel (nreB, MFS family protein) (119).

Organic Contaminant Degradation
Several major changes were made to this group of genes from earlier GeoChip versions. First, a number of old genes were removed as they no longer give useful information for various reasons such as crossover with other non-target genes that could not be resolved by HMMER or being so far down a degradation pathway(s) that they were involved in reactions that could be considered general cell metabolism. Second, several target genes were combined due to sequence similarity, which meant they could not be fully separated. Several genes were split due to enantiomer selectivity and other genes that were composed of 2 or more subunits were reduced to a single subunit. Several new genes were also added.
Genes will be listed below by chemical with the reason for that chemical being chosen. Several of the genes can be utilized in multiple degradation pathways as they are further "downstream" but will only be listed here one time.
Acetylene is a basic building block for a number of chemical processes and is degraded by Xamo (alkene monooxygenase) (179).
Saturated hydrocarbons (alkanes) are one of the main components of crude oil. Degradation of these compounds is important in bioremediation and in the ecology at natural oil seeps. Alkanes are degraded by alkylsuccinate synthase (AssA) (182).
Aniline is used in the manufacture of many products but mainly for polyurethane and is degraded by tdnQ (aniline dioxygenase) (183).
Benzaldehyde is a downstream product of a number of xenobiotic degradation pathways and is degraded by xylC (4-hydroxybenzaldehyde dehydrogenase) (189).
Benzonitrile is a common solvent and intermediate in many industrial chemical processes and is degraded by nitrilase (196) and amiE (aliphatic amidase) (197).
Carbazole is used in the minor amounts for the production of dyes and is produced during incomplete combustion. It is degraded by carA (carbazole 1,9a-dioxygenase) (202).
Chloromethane is produced in minor amounts by phytoplankton. It was once used as a refrigerant but has been discontinued. It is degraded by cmuA (isopropylbenzene dioxygenase) (209).
Cyanuric acid is used as part of, or in the manufacture of, bleaches, disinfectants or herbicides. It is also involved in atrazine degradation. It is degraded by atzE (biuret hydrolase) (210).
Dichloromethane is a common solvent and used to "glue" some plastics together. It is degraded by Dimethyl sulfoxide is a solvent that can mix with many different organic solvents and with water and is degraded by dmsA (dimethyl sulfoxide reductase) (228).
Gallate is a natural coumpound found in a number of plants and is used to manufacture pharmaceuticals. Gallic acid was used in the manufacture of inks. It is degraded by athL (pyrogallol hydroxyltransferase) (237).
Glyphosate, also known as the herbicide "RoundUp" and has been in use since the 1970s. It is one of the most highly used herbicides in the world. It is degraded by Phn (carbon-phosphorus lyase) and mauAB (methylamine dehydrogenase) (238,239).
Isopropylbenzene (cumene) is commonly found in crude oil and used as a base for the production of other chemical compounds. It is degraded by cumB (dihydroxyisopropylbenzene Hydroxyacetophenone is an intermediate in the breakdown of other compounds such as bisphenol A and is degraded by arylest (arylesterase) (241).
Methanesulfonic acid is used as an acid catalyst in a variety of organic solvents and is degraded by MSAMO (methanesulfonic acid monooxygenase) (243).
Methylquinoline is used in dye production and is degraded by qorL (quinoline 2-oxidoreductase), MTBE is widely used fuel oxygenate. It has caused widespread ground water contamination and is degraded by alkB (alkane 1-monooxygenase) (245).
Nicotine is a natural alkaloid in several plants, especially tobacco. It is a stimulant in animals and was used as an insecticide. Nicotine is degraded by ndhC (nicotine dehydrogenase), 6HDNO ((S)- Nitrobenzene is a basic building block used in a number of chemical reactions and is degraded by amnB (2-aminophenol 1,6-dioxygenase), nbzA (nitrobenzene nitroreductase), and nbzB (hydroxylaminobenzene mutase) (248).

Nitrobenzoate is an intermediate in the breakdown of a number of chemicals and is degraded by
Nitroglycerin is a common and widely used explosive (TNT). It is degraded by xenAB (nitroglycerin reductase) (250).
Octane is an alkane component of gasoline and is degraded by alkK (acyl-CoA synthetase) (255).
Organophosphates are a type of insecticide and are degraded by adpB (aryldialkylphosphatase) (256).
Pentaerythritol tetranitrate (PETN) is a high-power explosive and much more toxic than other explosives like RDX. It is considered a "munitions constituent of great concern" by the DOD. It is degraded by Onr (pentaerythritol tetranitrate reductase) (259).
Phenoxybenzoate is used in the synthesis of larger chemical compounds and is degraded by pobA (p-hydroxybenzoate hydroxylase) and POBMO (phenoxybenzoate monooxygenase) (261,262).
Phenylacetaldoxime is used as a base chemical for pesticides and has potential applications in cancer treatment. It is degraded by oxdB (phenylacetaldoxime dehydratase) (263).
Phenylpropionate is naturally produced during breakdown of plant material and is also a component of synthetic steroids. It is degraded by hcaE (3-phenylpropionate dioxygenase), hcaB Pyrene is a natural compound found in coal tar and in combustion products, including those produced by the burning of gasoline. Used in the production of dyes. It is degraded by nidA (pyrene dioxygenase) (270) Reductive dehalogenase (Rd) is involved in the removal of halogen atoms from parent compounds such as PCE and TCE (271,272).
Tetrahydrofuran is a solvent and used in the manufacture of a variety of polymers and is degraded by thmA (tetrahydrofuran hydroxylase) (276).
Thiocyanate is used in the production of other chemicals and is degraded by scnC (thiocyanate hydrolase) (277).
Toluate is a base compound used in the manufacture of various plastics and is degraded by xylX   (18).
Selected sigma factors that are involved in transcription initiation for stress response genes include the housekeeping sigma factor σ70, σ38 for general stress response and σ32 and σ24 for heat shock (297). The haem-catalase katE was also included for general stress response (298) and a stringent response GTPase that maintains low intracellular concentrations of ppGpp (obgE) (299-301) (302). The ppGpp acts as a transcriptional regulator during periods of stress (303).
Heat and cold shock proteins were included because microorganisms are often exposed to temperature variations in the environment. Heat shock proteins include dnaK, grpE, groES, and groEL, molecular chaperones that prevent or correct denaturation (304) and the regulatory gene hrcA (305). Microbes handle cold shock by increasing the ratio of unsaturated to saturated fatty acids in membrane lipids. This is accomplished via the desaturase gene, des; the expression of which is controlled by the two-component system genes, desK-desR (306). In addition, there are also cold shock induced chaperon proteins, cspA and cspB (307).
Osmotic shock occurs when the cell encounters a sudden change in solute concentration in its surrounding environment, which can lead to a rapid increase or decrease of water in the cell. To protect themselves, microbial cells can modify the concentration of osmoprotectants within the cell using transport systems such as opuE, a sodium/proline symporter or the ProU transport system comprised of proV, proW, and proX (308). The ProU system has a broad substrate specificity, but preferentially transports glycine betaine and proline betaine (309,310).
An increase in reactive oxygen species can trigger oxidative stress. This stress response is regulated by perR and oxyR and includes induction of ahpCF, an alkyl hydroperoxide reductase, and kata, a catalase, to detoxify reactive oxygen species (311)(312)(313).
In environments where there is insufficient oxygen, cytochrome genes (cydA and cydB) are activated via regulatory genes such as fnr and arcA and arcB, a two-component system (314,315).
In addition, some microorganisms contain genes that allow them to use other electron acceptors, such as nitrate nitrate reductase genes (narG, narH, narJ, and narI) (316).
Another stressor commonly encountered by microorganisms is nutrient limitation. Two common nutrients that are often limiting are carbon, phosphate and nitrogen. The genes bglP (aryl-betaglucosidespecific enzyme II) and bglH (phospho-beta-glucosidase) allow for the use of aryl-βglucosides as an alternate carbon source (317).
External N limitation is sensed by glutamine in enteric bacteria (321). Genes for glutamine synthase (glnA) and the regulatory genes tnrA and glnR were included (322).
Protein stress is triggered by over production of recombinant proteins in microbial cells (323,324), which induces the activation of heat shock sigma factor σ32 and σ32-dependent genes (325). We selected clpC (ATPase subunit in the Clp machinery) and regulator gene ctsR to target protein stress (326,327).
New genes added include antioxidant enzymes such as catalase, peroxidase, and superoxide dismutase, which protect organisms from abiotic and biotically produced oxygen radicals; envelope stress genes which are involved in modifying and repairing the cellular envelope when under stressful conditions (328), and pH stress response genes.

Plant growth promotion:
Plant-microbe interactions are an important aspect of plant growth and health and bacteria and fungi produce a number of compounds to alter host plant metabolism and growth and increase stress tolerance and resistance to pathogens (329). Genes covered in this category include plantlike hormones (9 genes) such as gibberellin, ethylene, auxin and polyamines (spermidine synthase), which are involved in plant growth (330)(331)(332) and trehalose synthase genes, which act as a protecting agent to maintain structural integrity of the cytoplasm under environmental stress, such as drought conditions (333). In addition, genes involved in pathogen suppression were included. Siderophores from these beneficial bacteria compete with pathogens for available iron (329) and references therein), so genes related to siderophore production were included.
Antioxidants (superoxide dismutase, peroxidase) scavenge reactive oxygen species generated by plants in response to drought, nutrient and other stresses (334,335).
Microbial defense: Microbial defense mechanisms can indicate the presence of predators or competing microbes.
Antibiotic resistance. Microorganisms are frequently exposed to antibiotics both from natural sources (e.g., other bacteria in the surrounding environment) or from man-made sources (e.g., wastewater treatment plants). As such, microbes have developed mechanism to prevent damage from the antibiotics. These mechanisms can be intrinsic (functional or structural features that prevent the antibiotic from acting against the cell) or acquired (those resistance mechanisms derived from genetic elements that can be passed to other bacteria) (336). Primary mechanisms of resistance include prevention of entry, efflux, modification/absence of the antibiotic target, or inactivation of the antibiotic itself (336). Intrinsic features such as cell wall structures that minimize antibiotic entry or modified/absent targets generally do not require a specific gene to be present, so are difficult to test for with microarrays. So, most of the antibiotic resistance genes covered on the GeoChip are for efflux transporters (8 genes; e.g., ATP-binding cassette (ABC), multi-antimicrobial extrusion protein (MATE), major facilitator superfamily (MFS), resistancenodulation-division (RND), small multidrug resistance (SMR) transporters) or enzymes responsible for antibiotic degradation (9 genes). Several genes from previous GeoChip versions were split into multiple genes due to the number of sequences. These splits were done along phylogenetic lines. Two intrinsic resistance mechanisms were also included: the genes qnr, which expresses a protein that binds to and protects DNA gyrase and topoisomerase IV from attack by ciprofloxacin (337)   Fosfomycin is a broad spectrum antibacterial used alone or in conjunction with other antibiotics.
Quinolones are broad spectrum synthetic antibiotics used for both Gram-positive and -negative infections in human medicine and agriculture. There are several different mechanisms of resistance. Qnr is a plasmid borne mechanism that protects bacterial topoisomerases (440).
Tetracycline-type antibiotics are natural and synthetically created polyketide antibiotics that have a broad spectrum of activity by inhibiting protein synthesis. In addition to the MFS transporters mentioned above, there are also enzymatic mechanisms. TetX is a monooxygenase that provides resistance to tetracycline antibiotics including those that have been only recently approved such as Tygacil (441). TetM and related genes provide resistance to tetracycline antibiotics through protection of the ribosome (442). Other genes include tetO (443), tetQ (444), tetW (445), tetS (446), and otrA (447).

Antimicrobial biosynthesis. Microorganisms produce a number of compounds that inhibit growth
or kill other organisms (448). These include "classic" antibiotics such as chloramphenicol (paraaminobenzoate synthase, glutamine amidotransferase, component II) and beta-lactams (isopenicillin N synthase), and well as other compounds like phenazines (phzB) (449) and pyrrolnitrin (prnD) (450), as well as vanadium haloperoxidase, which is involved in the production of various halogenated compounds in algae (451), and hydrogen cyanide synthase, which is involved in the production of the antimicrobial hydrogen cyanide (452).

CRISPR. Bacteria and archaea are under constant pressure from viruses and other mobile
parasitic genetic elements. CRISPR-Cas systems are adaptive immune systems used to defend against these elements through a multistep process during which the invader is recognized, short pieces are incorporated between short DNA repeats and used to recognize subsequent infections (453). This "immune system" most likely also plays an important role in the environment in relation to predation by viruses and incorporation of exogenous DNA. The CRISPR locus itself is made up of viral or plasmid sequence genome snippets separated by short repeat sequences. It is not a "functional gene" in and of itself as these repeat sequences are too short to use for our current probe design pipeline and the interspersed viral/plasmid sequences are constantly changing as the organisms are exposed to new sequences. So, CRISPR associated (Cas) genes were chosen for this section. The Cas proteins are suitable for probe design in our pipeline.
Previous research has also shown that by knowing which of the Cas proteins are present it is possible to define the type, and even subtype, of the CRISPR system(s) present in an organism.

Cas proteins selected (49 genes) covered various types and subtypes of CRISPR-Cas systems
were selected, such as cas and cmr (454,455).
Environmental toxins. A small portion of marine algae produce toxins that can negatively impact humans and animals (456). Under favorable conditions, harmful algal blooms (HAB) can occur resulting in poisoning through ingestion of contaminated food or water, skin contact, or by inhalation of the toxins. The number of HABs occurring annually has been increasing over the past few decades and the number of areas affected by the blooms have increased likely due to anthropogenic activities such as eutrophication, transport of harmful species via ballast water, warming-related weather events, and the increasing temperature and CO2 associated with global climate change (456).
Saxitoxin is a neurotoxin produced by dinaflagellates (marine waters) and cyanobacteria (freshwater) and is the etiological agent of paralytic shellfish poisoning resulting from consuming contaminated shellfish (457). Microcystins are hepatoxins produced by cyanobacteria (458). Virulence: Pathogens possess a number of virulence factors that directly or indirectly assist them in infecting and surviving within its host. Genes within this category include those for surface attachment, that aid in the avoidance of the host's immune response, such as capsules (459) Two bacterial structures that are also involved in adherence are pili and fimbriae. Pili play roles in mobility, surface attachment, and conjugation in bacteria (462,463). The major protein subunit of pili, pillin (464), was chosen to represent pili. Sequences within this gene family include type Sortases are a family of proteases and transpeptidases found in Gram-positive bacteria. They are needed for anchoring of surface proteins to the cell wall and adhesion to and colonization of host cells and tissues (467,468). Sortase sequences are for srtABCDF.
Other virulence proteins included sequences from CrfA, SrfB, EsaA, EssB, pGP2-D and IpgD, surface-exposed virulence protein BigA, iron-regulated outer membrane virulence protein IrgA, adherence and virulence protein A and virulence proteins S and Q.

Virus
Bacteriophages are an important part of the microbial community yet how this community changes in relation to environmental factors has not been studied in-depth. In the environment, these viruses are important to the turnover of nutrients by lysis of their hosts, the exchange of genetic information between hosts and to genetic drift by severely depleting or killing off particular strains of a host organism within a local area. Viruses of photosynthetic eukaryotic microorganisms can be important in both environmental and industrial settings. In the environment, they are involved in the turnover of nutrients and population control, especially in bloom situations of toxin producing organisms.
Gene selection for this group included identifying genes necessary for different points in the bacteriophage "life cycle": replication, infection (host identification, genome injection) structural components, and lysis of host organism as well as those that identify specific viral groups (genus or family). Proteins related to viral infection (tail fibers), replication (polymerases), and escape/lysis of the host cell (holins) as well as virion structural components (capsid/coat proteins) were selected to cover bacteriophages (prokaryotic hosts) and viral genera or families that infect fungi (mycoviruses) and other protists or who contain members that are known to be soil (e.g. Tobravirus) or water transmitted (e.g. Adenoviridae).
Genes covered in this section include transmission proteins are those that aid in the dispersal of the virus by its vector. These would include Tobravirus transmission protein (471)  Bacillariodnavirus is a relatively recent addition to protist viruses.
Other functional genes were submitted for specific virus groups (genera or species) rather than solely by function. A description of these follows.
Adenovirus (Adenoviridae) (475) genes include Adenoviridae_fiber for capsid fibers, which play an important role in the recognition and binding of the target receptor on the host cell (476).
Adenoviridae_hexon is the major capsid protein of the virus coat and a regular target used in PCR detection of this virus type (477)(478)(479)(480)(481)(482). Adenoviridae_protease is used for poly protein processing and is another common PCR target for this virus family (478).
Astroviridae (483) are covered by Astroviridae_capsid, the major capsid protein that is one of the two main targets used for virus detection (484) and Astroviridae_RdRp, the RNA dependent RNA polymerase, the main target for detection of Astroviruses (485).
Hepeviridae (486) are covered by the Hepeviridae_capsid, the major capsid protein is a regular target for pcr detection of this viral group (487,488) and Hepeviridae_pORF1, which contains several nonstructural proteins including RdRp and part of this orf has been used as a marker for viral detection (489).
Caliciviridae (490) are covered by RdRp_Caliciviridae and VP1capsid_Caliciviridae (491) Reoviridae are covered by VP7_Gserotype_Rotavirus, an outer capsid antigens that is commonly used for serotyping (496,497), VP6_Rotavirus, which encodes for a protein used to define subtypes within the VP7-VP4 types (498), VP4_Pserotype_Rotavirus, an outer capsid antigens that is commonly used for serotyping. and Enterotoxin_Rotavirus, an enterotoxin linked to the cellular cascade that triggers diarrhea (499).
Coronaviridae (505) is covered by Coronaviridae_M_protein, which plays an important role on virus assembly (506) and Coronaviridae_spike, a glycoprotein that helps determine host specificity and aids in entry into the host cell (507). Protozoan: Protists are key members of environmental food webs by linking different trophic levels together through detritivory and predation of lower levels and serving as food sources for higher levels.
They also make significant contributions to primary production. Photosynthetic protists are among the primary aquatic species responsible for primary production and play important roles in the biogeochemical cycling of carbon (C), nitrogen (N) and phosphorus (P) (508,509). Several genes were selected as phylogenetic markers for various non-fungal protozoan groups. These included actin, cytochrome oxidase subunit 1, glyceraldehyde 3-phosphate dehydrogenase, heat shock protein 70, heat shock protein 90, elongation factor 1 alpha, polyubiquitin, and tubulin, based on a review of literature (510). Other genes such as trichocyst matrix protein were selected since some protists possess exocytotic organelles that are believed to perform defensive functions (511)(512)(513).
Movement proteins such as the paraflagellar rod, a feature of kinetoplastid protozoa and necessary for their movement, were also included. These proteins may possibly also play other roles in pathogenesis (514). Attempts were made to cover as many members of the non-fungal protists as possible with genes that have previously been used for phylogenetic purposes by other researchers.
Multiple genes were submitted when possible to ensure optimal coverage of the protistal groups.
Oomycetes are plant pathogens that produce a wide variety of avirulence and effector proteins that aid in pathogenesis: (515)(516)(517)(518) Protease and glucanase inhibitors and are also believed to aid in maintaining infection (519,520). The necrosis-inducing protein, involved in host cell death, is also important in the infection process (521). Oomycetes also produce a number of enzymes to help break down host cellular components including pectinases, cutinases and amylases (522,523). Cercozoa (524) are known to express trehalose synthase in infected plant tissues (525).
Functional genes covered for protists are listed below.
Heterotrophic protists need a variety of carbon degradation enzymes for the breakdown of macromolecules. However, little work on this area been done in relation to most protistal groups, exceptions being the gut symbionts of termites and some ciliates. Covered genes include cellulases (526) and xylanase (527).
Silicon is an important element to a number of protists, including amoebas and diatoms and is used as the base element for the formation of protective shells or other structures (537)(538)(539)(540).
Silaffins are one of the important organic molecules associated with biosilica formation in diatoms, and the only one for which reliable sequences are known. It has been speculated that silaffins and the other biomolecules are involved in the deposition and patterning of the silica (541). Silicon transporters are needed for uptake of dissolved silicon (542). Genes for silicon biosynthesis (1 gene) used in the production of internal and external skeletons and a silicic acid transporter (1 gene) for internal enrichment of silicon (543) were included.
Photosynthesis is covered by chlorophyll, the major pigment involved in eukaryotic photosynthesis (544) and carotenoids, which act as accessory pigments in photosynthesis and as photoprotectants (545).
Energy processes are represented by carbamate kinase, which is involved in the energy metabolism for a few pathogenic protists, such as Giardia (546,547).
What little is known about metal cycling in protists has mostly been aimed at metal resistance in relation to contamination though industrial activities (548,549) and includes cadmium (550) and copper metallothionein (549). Other genes covered include the paraflagellar rod (551), which plays an important part in motilitiy in certain protitst including some important pathogens (514).
Trichocysts are believed to be an important part of the defense mechanism for some protest groups such as Paramecium to avoid predation (511). Vanadium bromoperoxidase is an essential enzyme for the production of halogenated metabolites. These metabolites can include antibiotics and other bioactive compounds (552).

Fungi
Fungi are important to the environment and to numerous human activities. In the environment, they help in the turnover of nutrients by degrading a number of large organic molecules, transporters of inorganic nutrients as mycorrhizal symbionts to most land plants, and as pathogens. To humans, fungi are an important source of food and many other products especially industrially useful enzymes. However, they can also cause a number of economically important diseases that affect humans, livestock, or agriculturally important crop plants.
The genes chosen for inclusion fell into several general categories: Organic remediation, carbon degradation, metal resistance, antifungal resistance, virulence and biogeochemical cycles of iron, sulfur, nitrogen, and phosphorous. The significance of these categories was described above.
Specific fungal genes include cyanide dehydratase (553), needed to detoxify cyanide produced by cyanogenic plants during successful infection; enniatin synthase (554,555), an important virulence gene, scytalone dehydratase (556) is a disease determinant in Magnaporthe grisea, an important rice pathogen, and a potassium uptake protein Trk_fungi (557). Chitin is a polysaccharide present in many organisms including fungi. Chitin synthase was added as chitin is an important biomolecule in fungi (558,559).

Bacterial phylogeny
The phylogenetic marker gyrB was included to act as a phylogenetic marker, since it can be used for identification at the species/strain levels (560). The more commonly used 16S rRNA gene has a slower evolution rate, making it difficult to discern closely related strains. This gene was divided into several sets based on phylogenetic groups and included gyrB_Arch (archaea), gyrB_Actinobacteria, gyrB_Firmicutes, gyrB_G_proteobacteria, gyrB_Proteobacteria, and gyrB_Bact_other.

Energy generation
Photosynthetic. Prokaryotes that utilize light either for carbon fixation or other metabolic processes form an important part of the microbial world especially in aquatic environments.
Prokaryotic pigments can have a wide range of function including photosynthesis, photoactive protein pumps and pathogenesis (545,561,562). Genes for a number of different photoactive systems, with emphasis on photosynthesis, were submitted. These will help our understanding of prokaryotic metabolism in surface environments especially those organisms that fix carbon dioxide either as their main source of carbon, or as a backup source when fixed organic carbon becomes scarce. In pathogenic organisms these pigments may be involved in virulence mechanisms (562). Genes for the biosynthesis of pigments such as bacteriochlorophyll (16, magnesium protoporphyrin IX methyltransferase), chlorophyll (9, magnesium-protoporphyrin IX chelatase), bilins (4, phycocyanobilin:ferredoxin oxidoreductase), carotenoids (22, lycopene beta cyclase), and rhodopsins (1, bacteriorhodopsin) were selected due to their association with or involvement in photosynthesis and thus impact upon primary production (563)(564)(565)(566). Carotenoids are also economically important as antioxidants and have beneficial health effects for humans and other animals.
Bacteriochlorophyll is involved in photosynthesis (567). Carotenoids can be involved in both photosynthesis and as photoprotectants (568). Phycobilins are involved in photosynthesis for a few groups of organisms (565,569,570). Proteorhodopsin is a light-driven proton pump and is theorized to have a range of physiological functions (571,572).
Electron transfer. Microorganisms generate energy by "coupling the flow of electrons in membranes to the creation of an electron motive force" (573). The electrons travel from low to high potential via electron carriers. Prokaryotes use a variety of electron transfer pathways.
Genes representing several cytochrome and hydrogenase genes were selected. Cytochromes are heme-containing proteins used to shuttle electrons (574). Hydrogenases catalyze the reversible oxidation of hydrogen, providing reducing ability or acting as an electron sink (575).