Viromic Analysis of Wastewater Input to a River Catchment Reveals a Diverse Assemblage of RNA Viruses

Enteric viruses cause gastrointestinal illness and are commonly transmitted through the fecal-oral route. When wastewater is released into river systems, these viruses can contaminate the environment. Our results show that we can use viromics to find the range of potentially pathogenic viruses that are present in the environment and identify prevalent genotypes. The ultimate goal is to trace the fate of these pathogenic viruses from origin to the point where they are a threat to human health, informing reference-based detection methods and water quality management.

sapovirus (SaV), astrovirus, and adenovirus, indicating that people are shedding viruses that are not necessarily detected in a clinical setting (8). This same study found a spike in norovirus genogroup GII sequence signatures in sewage 2 to 3 weeks before the outbreak of associated disease was reported in hospitals and nursing homes. The suggestion, therefore, is that environmental viromics can provide an early warning of disease outbreaks, in addition to the monitoring of virus dissemination in watercourses.
Recent reviews have proposed the use of viral metagenomics or viromic approaches as an alternative method to test for the presence of pathogenic viruses in the environment, offering the potential to detect novel genotypes or even entirely novel viruses (2,9,10). Potential new viral markers for fecal contamination have already been revealed, such as pepper mild mottle virus and crAssphage (11,12), among the huge diversity of human viruses found in sludge samples (13)(14)(15)(16).
In this pilot study, we have used viromics to investigate the presence of humanpathogenic RNA viruses in wastewater and estuarine surface water and sediment in a single catchment. The water and sediment samples were collected at the wastewater treatment plant (Llanrwst, Wales, United Kingdom) and downstream from it at the estuary of the river Conwy near a bathing water beach (Morfa, Wales, United Kingdom) (Fig. 1). To our knowledge, this is the first study to use unamplified environmental viral RNA for sequencing library construction, sequence data set production, and subsequent analysis. Because we used a directional library sequencing protocol on RNA, rather than amplifying to cDNA, we were able to distinguish single-stranded from double-stranded RNA genome fragments.  only a minor (Ͻ5%) proportion of the total assigned reads, with a few exceptions. In wastewater influent sample LI_11-10, reads assigned to the dsDNA family Papillomaviridae accounted for 61% of the total (Fig. 2, dark pink), and these reads were assembled into a single contig representing a nearly complete Betapapillomavirus genome. In the surface water sample, reads assigned to the ssDNA families Circoviridae and Microviridae represented 50% and 12% of the total, respectively (Fig. 2, yellow and orange), assembling into contigs representing a significant proportion of the genome. The presence of both ssDNA and dsDNA virus signatures in all data sets is most likely due to incomplete digestion of the viral DNA with the DNase Max kit. The families of dsRNA viruses present in these data sets were Totiviridae (fungal and protist hosts), Reoviridae (invertebrate, vertebrate, and plant hosts), Picobirnaviridae (mammals), Partitiviridae (fungi and protists), and Birnaviridae (vertebrates and invertebrates), with a small number of reads and contigs recognized as unclassified dsRNA viruses ( Fig. 2 and 3). None of these groups were present in all libraries, but totivirus FIG 2 Taxonomic distribution of curated read data (relative abundance) at the virus family level. Reads were assigned to a family or equivalent group by MEGAN6 using a lowest-common-ancestor algorithm, based on blastx-based homology using the program Diamond with the RefSeq Viral protein database (January 2017 version) and the nonredundant protein database (May 2017 version). Only viral groupings are shown. LI, sewage influent; LE, sewage effluent; SW, estuarine surface water; Sed, estuarine sediment. and picobirnavirus (Fig. 2, dark green) signatures were present in all wastewater samples and reoviruses (Fig. 2, bright green) were found in three of the four wastewater samples. Partitiviridae signatures were only found in wastewater samples LE_11-10 and LI_13-9, while Birnaviridae reads were only present in the wastewater LE_13-9 libraries. The sediment and surface water samples did not have detectable levels of dsRNA virus sequences.
Positive-sense ssRNA viruses were the most diverse class of viruses present in these data sets. The family Tombusviridae, which groups plant viruses with monopartite or bipartite linear genomes (19), was present in all samples with the sole exception of the wastewater influent sample LI_11-10 ( Fig. 2, cornflower blue, and Fig. 3). Virus signatures belonging to the family Virgaviridae, representing plant viruses, were present in all wastewater samples at high relative abundances (Fig. 2, dark red). Other highly represented families or groupings were the families Dicistroviridae (invertebrate hosts) and Nodaviridae (invertebrate and vertebrate hosts) and the bacteriophage family Leviviridae, the plant virus genus Sobemovirus (Fig. 2, medium purple), and the groupings of "unclassified ssRNA positive-strand viruses" and several unclassified/unassigned/ environmental members of the order Picornavirales. Sediment sample Sed1 was the only sample with signatures of the family Alvernaviridae, which has as its sole member the dinoflagellate virus Heterocapsa circularisquama RNA virus 01. The wastewater effluent sample LE_11-10 and influent sample LI_13-9 were the only samples with Heatmap colors denote relative abundances per sample. Contigs larger than 300 nucleotides (nt) were assigned to a family or grouping by MEGAN6 using a lowest-common-ancestor algorithm, based on blastx-based homology using the program Diamond with the RefSeq viral protein database (version January 2017) and the nonredundant protein database (May 2017 version). Only those families/groups comprising large contigs (Ͼ1,000 nt) or with contigs mapping to viral signature genes (e.g., capsid and RNA-dependent RNA polymerase genes) were retained. LI, sewage influent; LE, sewage effluent; SW, estuarine surface water; Sed, estuarine sediment.

Environmental Viromics for the Detection of Pathogens
Calicivirus signatures (Fig. 2, medium blue), and sample LE_11-10_1 and LE_1-10_2 were the only samples with Astroviridae reads (vertebrate host). Several families of the order Picornavirales were detected in the wastewater samples at different levels in different samples, and a small number of unassigned picornaviruses were detected in the surface water sample (SW).
We did not observe any known negative-sense ssRNA [(Ϫ)ssRNA] viruses in any of the sequencing libraries, but it is possible that some of the unaffiliated viral contigs belong to this class. The known human-pathogenic (Ϫ)ssRNA viruses are enveloped (19) and predicted to degrade more rapidly than the nonenveloped enteric viruses, especially in wastewater (20,21). We cannot rule out the possibility that (Ϫ)ssRNA viruses were present but were removed by our sampling protocol.
The general wastewater viral diversity found here is similar to that reported previously. Those studies that investigated RNA viruses found both bacterial and eukaryotic viruses, with a high abundance of plant viruses of the family Virgaviridae, which includes the Tobamovirus Pepper mild mottle virus (11,14). The families of viruses with potential human hosts found in previous metagenomics studies of sewage include Astroviridae, Caliciviridae, Picobirnaviridae, and Picornaviridae (13)(14)(15)(16), of which only picobirnaviruses were recovered in all wastewater viromes in this study. In contrast, members of the family Reoviridae, represented by the genus Rotavirus, were found in three of our four wastewater samples but were not detected in many of the previous studies (14)(15)(16).
Potential human-pathogenic viruses. An important aim of this study was to investigate the presence and genomic diversity of potential human-pathogenic RNA viruses in different sample types within the river catchment area. To minimize misassignments of short sequences to taxa, we used the assembled, curated contig data set and looked for contigs representing nearly complete viral genomes.
Presence of a norovirus GI.2 genome. We were particularly interested in finding norovirus genomes in order to explore the genomic diversity of these important and potentially abundant pathogens originating from sewage and disseminated in watercourses, with implications for shellfisheries and recreational waters. This is of relevance due to known issues of sewage contamination in the region (22). Members of the genus Norovirus (family Caliciviridae) are nonenveloped, icosahedral (ϩ)ssRNA viruses with a linear, unsegmented,~7.6-kb genome encoding three open reading frames (ORFs) (19). These viruses are divided into different genogroups, of which GI and GII are associated with human gastroenteritis (23,24). Noroviruses are identified routinely by qRT-PCR, providing an opportunity here to examine correlations between qRT-PCR and metaviromic data.
We only found norovirus signatures in the libraries of wastewater effluent sample LE_11-10. These reads assembled into a single contig of 7,542 bases, representing a nearly complete norovirus genome (GenBank accession number MG599789). Read mapping showed uneven coverage over the genome length between 18ϫ and 745ϫ (13,165 reads of library 1 and 8,986 reads of library 2). Based on this mapping, we performed variant calling and corrected the consensus sequence in cases where the variant was present in more than 85% of the reads.
A BLASTN search revealed two close relatives to our wastewater-associated norovirus genome, norovirus Hu/GI.2/Jingzhou/2013401/CHN (KF306212), which is 7,740 bases in length (25), displaying a nucleotide sequence identity of 99% over 99% of the genome length, and norovirus Hu/GI.2/Leuven/2003/BEL (FJ515294), at 95% sequence identity over 99% of the alignment length (Fig. 4). From the 5= end of our norovirus contig, 62 bases were missing compared with the sequence of Hu/GI.2/Jingzhou/ 2013401/CHN, and from the 3= end, 165 bases and the poly(A) tail were not present. We compared the sequence of our norovirus with that of Hu/GI.2/Jingzhou/2013401/CHN base by base and observed 81 single-nucleotide polymorphisms (SNPs) and no other forms of variation. Of the SNPs, only eight were nonsynonymous, resulting in five different amino acids incorporated in the nonstructural polyprotein (ORF1), one in the major capsid protein (MCP) (ORF2), and two in the minor structural protein (ORF3). According to the current classification criteria, this level of similarity places our assembled genome in genogroup GI, genotype GI.2, with only a single amino acid differing between the major capsid protein of Hu/GI.2/Jingzhou/2013401/CHN and the genome assembled here.
We tested the genotype grouping of our genome in a whole-genome phylogeny with all complete genome sequences of genogroup I available in GenBank. The phylogenomic tree clearly delineated the different genotypes within genogroup GI, placing the newly assembled genome within genotype GI.2, with the reference isolate for GII used as an outgroup (Fig. 5).
For further validation, the full genome of the novel norovirus GI was recovered using RT-PCR. However, the amplicon could not be ligated into a plasmid and, hence, was not fully sequenced.
Presence of diverse rotavirus segments in wastewater samples. Rotaviruses are segmented dsRNA viruses belonging to the family Reoviridae that cause gastroenteric illness in vertebrates and are transmitted through the fecal-oral route (19). Read signatures assigned to the genus Rotavirus were found in three of the four wastewater samples (all but LI_11-10). Wastewater influent sample LI_13-9 contained the most signatures, with approximately 75,000 reads, assembled into 120 contigs, representing genome fragments of 10 of the 11 rotavirus segments. At the species level, these genome fragments were assigned to either species Rotavirus A or Rotavirus C. Comparing the amino acid sequences of the predicted proteins, some contigs showed high levels of identity (Ͼ88%) with the segments of either rotavirus A (RVA) or rotavirus C (RVC) reference genomes as available in the RefSeq database (26,27), while others showed lower identities with a variety of RVC isolates only. The segmented-genome nature and the possibility of segment exchange make it difficult to confidently identify the number of rotavirus types present in this sample. Given the amino acid similarities with both RVA and RVC types, we suggest there are at least two and possibly three types present here.
Using the RotaC 2.0 typing tool for RVA and Blast-based similarity to known genotypes, we have typed the rotavirus genome segments found here ( Table 2). The combined genomic makeup of the RV community in sample LI_13-9 was G8/G10/Gx- (28,29). Environmental Viromics for the Detection of Pathogens Partial genomes of other potentially pathogenic RNA viruses. In sample LI_13-9, a small contig of 347 bases was found that was 94% identical at the nucleotide level to ORF1 of sapovirus Mc2 (AY237419) in the family Caliciviridae. We also identified four contigs of approximately 500 bases in sample LE_11-10 that resembled most closely the astrovirus MLB2 isolates MLB2/human/Geneva/2014 (KT224358) and MLB2-LIHT (KX022687), at 99% nucleotide identity. In addition, we identified several reads and contigs assigned to the family Picornaviridae, which comprises a diverse set of enteric viruses, but the closest relatives in the databases were metagenomically assembled or unidentified picornaviruses.
Picobirnaviruses showed a high prevalence in wastewater. All the wastewater virome libraries contained signatures assigned to the dsRNA family Picobirnaviridae, genus Picobirnavirus (Fig. 2), and these reads assembled into between 42 (LE_13-9) and 510 (LI_13-9) contigs. Both picobirnavirus genome segments, segment 1 encoding two hypothetical proteins and segment 2 on which the RNA-dependent RNA polymerase (RdRP) is encoded, were observed in the samples. The contigs showed little sequence similarity with the Human picobirnavirus reference genome (RefSeq GenBank accession numbers NC_007026.1 and NC_007027.1). Phylogenetic analysis of a partial region of the predicted RdRPs in the virome contigs was not able to resolve any cluster or evolutionary origin (Fig. 6A). Picobirnavirus RdRPs from human, animal, and environmental isolates, as well as the majority of the virome sequences, were grouped in one large, unsupported cluster that showed relatively little genomic diversity. While many picobirnaviruses have been isolated from humans with gastroenteritis, a review of the known cases suggested that picobirnaviruses are probably not the main cause of acute diarrhea and are secondary pathogens with potentially synergistic effects (30). A qRT-PCR-based investigation into the suitability of human picobirnaviruses as indicators of human fecal contamination showed that they were not present in a sufficient proportion of tested samples to be good water quality indicators (31), but their high relative abundance in our sample set warrants further investigation for their use as water quality markers using metaviromic methods.
A recent study of picobirnaviruses gave rise to the hypothesis that these viruses do not infect mammals but are a new family of RNA bacteriophages, based on the presence of bacterial ribosome binding sites (RBS) upstream from the coding sequences (CDSs) (32). To test this hypothesis, we extracted all contigs with amino acid similarity to RdRPs or capsid proteins of known picobirnaviruses, annotated the CDSs, and extracted the 21 nucleotides upstream from the transcription start site. In the 233 contigs found, 71 partial CDSs were predicted, from which we extracted 17 5= UTRs (untranslated regions), discarding those partially annotated CDSs missing the transcription start site. We discovered the 6-mer motif AGGAGG (Fig. 6B) in 100% of the upstream sequences, similar to the frequency reported by Krishnamurthy and Wang (32), who found at least a 4-mer RBS in 100% of the 98 picobirnavirus 5= UTRs investigated. In contrast, the different families of eukaryotic viruses analyzed in that study only showed a low incidence of RBSs, which were mostly 4-mers. Our findings, therefore, support the hypothesis that picobirnaviruses are bacteriophages, and we suggest that they belong to a novel RNA bacteriophage family with a high level of genomic diversity.

DISCUSSION
We set out to explore the possibility of using viromics to find human-pathogenic RNA viruses in the environment. We have been successful in identifying several potentially human-pathogenic, including potentially zoonotic, viral genomes from the wastewater samples but did not find any in the surface estuarine water and sediment It is important to note here that during the RNA extraction process, many biases could have been introduced, leading to a lower recovery of input viruses. Samples were first concentrated from volumes of 1 liter (wastewater) or 50 liters (surface water) down to 50 ml, using tangential flow filtration (TFF) at a molecular-mass cutoff of 100 kDa, followed by polyethylene glycol (PEG) 6000 precipitation. These samples were diluted in fresh buffer, filtered through syringe filters of 0.22-m pore size, and then treated with nuclease to remove free DNA and RNA. Previous research has shown that, while any enrichment method aimed at fractionating the viral and cellular components will decrease the total quantity of viruses, a combination of centrifugation, filtration, and nuclease treatment increases the proportion of viral reads in sequencing data sets (33). After implementing these steps, we performed viral RNA extraction using the Mo Bio PowerViral environmental DNA/RNA extraction kit, which has previously been shown to perform best overall in spiking experiments with murine norovirus, in terms of extraction efficiency and removal of inhibitors (34). The kit has, however, given low recoveries of viruses from sediment (35).
We did not perform an amplification step before library construction with the NEBNext Ultra directional RNA library preparation kit for Illumina, to retain the genome sense and strand information. Instead, we increased the number of cycles of random PCR during library preparation from 12 to 15 to counteract the low input quantity of RNA (Ͻ1 ng). The random amplification during library construction led to a trade-off in which genome strand information was gained for a loss of quantitative power, making it difficult to compare abundances of viral types within and across libraries. This random-PCR-based bias has been highlighted before, but the proposed solution of using library preparation protocols which limit the use of PCR is only feasible with large amounts of input nucleic acid (36), which we have not found to be possible when processing environmental/wastewater samples to generate RNA metaviromes.
A critical issue to highlight here is the inclusion of controls in our sequencing libraries in order to identify potential contaminants and their origins, as has been suggested previously (37,38). There have been multiple reports of false-positive genome discoveries, in particular the novel parvovirus-like hybrid in hepatitis patients that was later revealed to originate from the silica-based nucleic acid extraction columns (39)(40)(41). In this study, we included a positive control that comprised bacterial cells (Salmonella enterica serovar Typhimurium isolate D23580; GenBank RefSeq accession number NC_016854) and mengovirus (36), an RNA virus that serves as a process control, and two negative controls, an extraction control and a library preparation control. Analysis of the control libraries showed that while the Salmonella cells and DNA were successfully removed from the positive-control sample by the enrichment protocol, the mengovirus was not recovered. Subsequent qRT-PCR analysis revealed that the mengovirus remained detectable in the preprocessing stages of the extraction but was lost after RNase treatment (data not shown). The inclusion of an inactivation step of the DNase at 75°C potentially exacerbated the effect of the RNase step. Consequently, it is likely that we have missed viral types during the extraction process despite having still managed to recover an RNA metavirome harboring substantial diversity.
Further examination of the HiSeq and MiSeq control data sets revealed a wide range of contaminant signatures of prokaryotic, eukaryotic, and viral origin, making up 45 million read pairs per control on the HiSeq platform and 1 million read pairs for the MiSeq, even though the 16S and 18S rRNA PCR and RT-PCR reactions produced no visible bands on an agarose gel. Most bacterial contaminant reads belonged to the phyla Proteobacteria, Actinobacteria, and Firmicutes. The most abundant genera included Corynebacterium, Propionibacterium, Sphingomonas, Ralstonia, Pseudomonas, Streptomyces, Staphylococcus, and Streptococcus, members of which have been identified as common laboratory contaminants in the past (42). Within the eukaryotic signatures, human-derived, Beta vulgaris, and Anopheles reads were the most prevalent, pointing toward potential cross-contamination of the sequencing libraries. A small number of virus signatures were also identified, with the most prominent being feline calicivirus and dengue virus. The presence of the calicivirus was traced back to the library preparation kits after the libraries were reconstructed and resequenced. The dengue virus signature was a Ͻ100-nt sequence which was coextracted in all the samples and potentially originated in one of the reagents or the spin extraction column. All sequences present in the controls were carefully removed from the sample data sets during the quality control stage of the bioinformatics processing before further analysis. For future experiments, we will omit the RNase treatment step during extraction and filter out any contaminating rRNA or cellular-derived mRNA sequences as part of the bioinformatic quality control workflow.
Our results show that while contamination is an issue when dealing with lowbiomass samples, the combination of increased random PCR cycles during library preparation, deep sequencing (i.e., HiSeq rather than MiSeq), and computational Environmental Viromics for the Detection of Pathogens subtraction of control sequences provides data of sufficient quantity and quality to assemble nearly complete RNA virus genomes de novo.
Norovirus. Noroviruses are one of the most common causes of gastrointestinal disease in the developed world, with an incidence in the United Kingdom estimated as approaching 4 million cases per annum (43). The genotype most commonly associated with disease is GII.4 (44)(45)(46), which was not detected in the metaviromes generated here.
We retrieved one norovirus GI genome, assembled from 22,151 reads, in wastewater effluent sample LE_11-10. This finding was in direct conflict with the qRT-PCR analysis of this sample, which did not detect any NoV GI signatures (Table 1). In contrast, NoV GII signatures were detected by qRT-PCR, but no NoV GII genomes or genome fragments were observed in the virome libraries. One hypothesis to explain the discrepancy between PCR and viromics approaches lies in the differences in extraction protocol. For qRT-PCR, no viral enrichment step was performed and RNA was not extracted with the PowerViral kit. Therefore, NoV GII could have been lost before virome sequencing, as was the process control mengovirus. An alternative hypothesis is that the NoV GII signatures detected during qRT-PCR were derived from fragmented RNA or from particles with a compromised capsid. In both these cases, the RNA would not be detected in the virome data because of the RNase preprocessing steps implemented in the enrichment/extraction protocol. This calls into question the reliance of qRT-PCR for NoV detection and whether the detected viruses are infectious or merely remnants of previous infections. Further research using, for example, capsid integrity assays combined with infectious particle counts will need to be conducted to assess the validity of qRT-PCR protocols for norovirus detection.
The inability to identify NoV GI with qRT-PCR might be related to the mismatched base present in the forward primer sequence used for detection, but even without mismatches, primer-probe pairs can be improved to provide better detection. In a recent study, researchers designed an improved probe for NoV GI.2 strains, lowering the limit of detection for these strains from waterborne samples (47). It is, therefore, possible that the NoV GI.2 detected here with viromics methods was present below the limit of detection of the ISO standard primer/probe combination (48) used in our study. Viromics as a means of investigating water samples for the presence of norovirus does have the advantage of demonstrating the presence of an undegraded genome, provided the sample processing requirements do not lead to excessive loss of virus particles, resulting in false negatives. Certainly, time and cost permitting, viromics is a useful adjunct to qPCR for samples for which it is deemed particularly important or critical that the presence of intact viral genome be determined.
While there have been recent breakthroughs in growing human NoV (49,50), its culture remains very difficult and not yet suitable for routine testing for NoV in the environment. Hence, studies using male-specific coliphages, such as MS2 and GA, which are ssRNA phages belonging to the family Leviviridae, are still worthwhile as alternative model systems (51,52). Interestingly, while some levivirus signatures were present in all wastewater samples (Ͻ500 reads), we observed a striking cooccurrence of these viruses with norovirus signatures in both libraries of sample LE_11-10 (Ͼ2,500 reads). The most commonly observed viruses in this sample were Pseudomonas phage PRR1, an unclassified levivirus, and Escherichia phages FI and M11 in the genus Allolevivirus. Further studies with more samples and replicates will indicate whether there is a significant correlation between the presence of leviviruses and noroviruses in water samples. Furthermore, the higher abundance of alloleviviruses than of MS2-like viruses could indicate that the former might be more relevant as model systems for noroviruses.
Rotavirus. Rotaviruses are, like noroviruses, agents of gastroenteritis, but the disease is commonly associated with children under the age of 5, where severe diarrhea and vomiting can lead to over 10,000 hospitalizations per year in England and Wales (53). Since the introduction of the live attenuated vaccine Rotarix, the incidence of gastroenteritis in England has declined, specifically for children aged Ͻ2 and during peak rotavirus seasons (54)(55)(56). Therefore, the discovery of a diverse assemblage of rotavirus genome segments in the wastewater samples here was less expected than the norovirus discovery. While we were unable to recover the genome of the vaccine strain, our genomic evidence suggests that at least one RVA and one RVC population were circulating in the Llanrwst region in September 2016.
The genome constellation for the RVA segments in sample LI_13-9, G8/G10-P[1]/ P [14]/P[41]-I2-R2-C2 -M2-A3/A11-(N)-T6-E2-(H) (N and H segments were not recovered in this study), is distinctly bovine in origin (28). The closest genome segment relatives based on nucleic acid similarity, however, have been isolated from humans (Table 2), possibly pointing toward a bovine-human zoonotic transmission of this virus (57). The same genomic constellation has been found recently when unusual G8P [14] RVA isolates were recovered from human strain collections in Hungary (58) and Guatemala (59) and isolated from children in Slovenia (60) and Italy (61). Cook and colleagues calculated that there would be approximately 5,000 zoonotic human infections per year in the United Kingdom from livestock transmission, but many would be asymptomatic (62). The presence of RVA in the wastewater of Llanrwst could be from zoonotically infected individuals shedding into the wastewater, but it is equally likely that RVA from cattle farms in the area spilled over into the sewage system.
The origins of the RVC genome segments are more difficult to trace, because of lower similarity scores with known RVC isolates. The majority of the segments were similar to porcine RVC genomes, while others showed no nucleotide similarity at all, only a low degree of amino acid similarity. An explanation for the presence of pig-derived rotavirus signatures could be farm runoff. While farm waste is not supposed to end up in the sewage treatment plant, it is likely that the RVC segments originate directly from pigs, not through zoonotic transfer. Runoff from fields onto public roads, broken farm sewer pipes, or polluted small streams might lead to porcine viruses entering the human sewerage network, but we cannot provide formal proof from the data available. Based on the evidence, we hypothesize that there are one or possibly two divergent strains of RVC circulating in the pig farms in the Llanrwst area.
Conclusion. In this study, we investigated the use of metagenomics for the discovery of RNA viruses circulating in watercourses. We have found RNA viruses in all samples tested, but potential human-pathogenic viruses were only identified in wastewater. The recovery of plant viruses in most samples points toward potential applications in crop protection, for example, the use of metaviromics in phytopathogen diagnostics. However, technical limitations, including the amount of input material necessary and contamination of essential laboratory consumables and reagents, are currently the main bottleneck for the adoption of fine-scale metagenomics in routine monitoring and diagnostics. The discovery of a norovirus GI and a diverse set of rotavirus segments in the corresponding metaviromes indicates that qPCR-based approaches can miss a significant portion of relevant pathogenic RNA viruses present in water samples. Therefore, metagenomics can, at this time, best be used for exploration, to design new diagnostic markers/primers targeting novel genotypes, and to inform diagnostic surveys on the inclusion of specific additional target viruses.

MATERIALS AND METHODS
Sample collection and processing. Wastewater samples were collected as part of a viral surveillance study described elsewhere (63). Wastewater influent and effluent samples, 1 liter each, were collected at the Llanrwst wastewater treatment plant by Welsh Water (Wales, United Kingdom) (Fig. 1)  The wastewater and surface water samples were processed using a two-step concentration method as described elsewhere (63). In brief, the 1-liter (wastewater) and 50-liter (surface water) samples were first concentrated down to 50 ml using a KrosFlo research IIi tangential flow filtration (TFF) system (Spectrum Labs, United States) with a 100-PES (polyethersulfone) membrane. Particulate matter was then eluted from solid matter in the concentrates, using beef extract buffer, and then viruses were precipitated using polyethylene glycol (PEG) 6000. The viruses from the sediment samples were eluted and concentrated using beef extract elution and PEG precipitation as described elsewhere (35). The precipitates were eluted in 2-to 10-ml phosphate saline buffer (PBS, pH 7.4) and stored at Ϫ80°C.
Detection and quantification of enteric viruses with qRT-PCR. Total nucleic acids were extracted from a 0.5-ml aliquot of the concentrates using the NucliSENS miniMag nucleic acid purification system (BioMérieux, France). The final volumes of the nucleic acid solution were 0.05 ml (surface water and sediment) and 0.1 ml (wastewater samples). Norovirus GI (64,65) and GII (66,67), sapovirus GI (68), and hepatitis A and E viruses (69,70) were targeted in qRT-PCR assays as described elsewhere (71). Viral RNA extraction for metaviromic sequencing. Viral particles were extracted from the concentrated samples by filtration. In a first step, the samples were diluted in 10 ml of sterile 0.5 M NaCl buffer and incubated at room temperature (20°C) with gentle shaking for 30 min to disaggregate particles. The suspension was then filtered through a sterile, 0.22-m pore-size syringe filter (PES membrane; Millex). The sample was desalted by centrifugation (3,200 ϫ g, between 1 and 6 h for different samples) in a sterilized spin filter (Vivaspin 20, 100-kDa molecular-mass cutoff) and replacement of the buffer solution with 5 ml of a Tris-based buffer (10 mM Tris-HCl, 10 mM MgSO 4 , 150 mM NaCl, pH 7.5). The buffer exchange was performed twice, and the volume retained after the final spin was Ͻ500 l. The samples were then treated with Turbo DNase (20 units; Ambion) and incubated for 30 min at 37°C, followed by inactivation at 75°C for 10 min. In a next step, all samples were treated with 80 g RNase A (Thermo Fisher Scientific) and incubated at 37°C for 30 min. The RNase was inactivated with RiboLock RNase inhibitor (Thermo Fisher Scientific), the inactivated complex was removed by spin filtration (Vivaspin 500, 100-kDa molecular-mass cutoff), and the samples centrifuged until the volume was approximately 200 l. Viral DNA and RNA were coextracted using the PowerViral environmental DNA/RNA kit (Mo Bio Laboratories) according to the manufacturer's instructions. In this protocol, buffer PV1 was supplemented with 20 l/ml betamercaptoethanol to further reduce RNase activity. The nucleic acid was eluted in 100 l RNase-free water. The extracted viral DNA was degraded using the DNase Max kit (Mo Bio Laboratories) according to the manufacturer's instructions. The remaining viral RNA was further purified and concentrated by ethanol precipitation using 2.5ϫ the sample volume of 100% ethanol and 1/10 volume of diethyl pyrocarbonate (DEPC)-treated Na-acetate (3 M). The quantity and quality of RNA were determined with Bioanalyzer Pico RNA 6000 capillary electrophoresis (Agilent Technologies). Positive and negative extraction control samples were processed alongside the main samples. The positive-control samples contained Salmonella enterica serovar Typhimurium strain D23580, which is not found in the United Kingdom (72), and mengovirus as a process control virus (71,73).
The viral RNA extracts were tested for bacterial and eukaryotic cellular contamination using 16S and 18S rRNA gene PCR and RT-PCR, with primers e9F (74) and 519R (75) for the 16S rRNA gene and primers 1389F and 1510R (76) for the 18S rRNA gene. Complementary DNA was created using SuperScript III reverse transcriptase (Invitrogen) with random hexamer primers according to the manufacturer's instructions. RT-PCR was performed with MyTaq red mix (Bioline) for 35 cycles (95°C for 45 s, 50°C for 30 s, and 72°C 1 min 40 s) and visualized on a 1% agarose gel. Samples were considered suitable for sequencing if no DNA bands were visible on the gel.
Library preparation and sequencing. The library preparation and sequencing were performed at the University of Liverpool Centre for Genomics Research (CGR). Twelve dual-indexed, strand-specific libraries were created using the NEBNext Ultra directional RNA library preparation kit for Illumina, according to the manufacturer's instructions. These libraries were pooled and sequenced at 2 ϫ 150-bp read lengths on the Illumina HiSeq 4000 platform. This generated between 10 and 110 million paired reads per sample.
To confirm our results, a second set of libraries was constructed from new kits and a Milli-Q water sample was included as a library preparation control. The 13 resulting libraries were sequenced at 2 ϫ 150-bp read lengths on the Illumina MiSeq platform at the CGR, University of Liverpool. These data were used for verification and control purposes only, as the sequencing depth was insufficient for the bioinformatics analyses described in the rest of the study.
Bioinformatics. All command line programs for data analysis were run on the bioinformatics cluster of CGR (University of Liverpool) in a Debian 5 or 7 environment.
Raw fastq files were trimmed to remove Illumina adapters using Cutadapt version 1.2.1 with option -O 3 (77) and Sickle version 1.200 with a minimum quality score of 20 (78). Further quality control was performed with Prinseq-lite (79) with the following parameters: minimum read length of 35, GC percentage between 5 and 95%, minimum mean quality of 25, dereplication (removal of identical reads, leaving 1 copy), and removal of tails of a minimum of 5 poly(N) sequences from 3= and 5= ends of reads.
The positive-and negative-control libraries described earlier were used for contaminant removal. The reads of the control samples were analyzed using Diamond blastx (17) against the nonredundant protein database of NCBI (nr, November 2015 version). The blast results were visualized using MEGAN6 Community Edition (18). An extra contaminant file was created with the complete genomes of species present at over 1,000 reads in the positive-and negative-control samples. Then, bowtie2 (80) was used for each sample to subtract the reads that mapped to the positive-control, negative-control, or contaminant file. The unmapped reads were used for assembly with SPAdes version 3.9.0, with k-mer values of 21, 31, 41, 51, 61, and 71 and the options --careful and a minimum coverage of 5 reads per contig (81).
The contig files of each sample were compared with the contigs of the controls (assembled using the same parameters) using blastn of the BLASTϩ suite (82). Contigs that showed significant similarity with control contigs were manually removed, creating a curated contig data set. The unmapped read data sets were then mapped against this curated contig data set with bowtie2, and only the reads that mapped were retained, resulting in a curated read data set.
The curated contig and read data sets were compared to the RefSeq viral (January 2017 release) and nonredundant protein (nr, May 2017 release) reference databases using Diamond blastx at an e value of 1eϪ5 for significant hits (17,83,84). Taxon assignments were made with MEGAN6 Community Edition according to the lowest-common-ancestor algorithm with default settings (18). We chose the family level taxon assignments to represent the overall viral diversity because there is generally little amino acid identity between viral families. The taxon abundance data were extracted from MEGAN6 and imported into RStudio for visualization (85). Genes on the assembled contigs were predicted with Prokka (86) using the settings --kingdom Viruses and an e value of 1eϪ5. Multiple alignments of genes and genomes were made in MEGA7 using the MUSCLE algorithm with default settings (87,88). The alignments were manually trimmed, and phylogenetic trees were built using the maximum-likelihood method in MEGA7 with the default settings. Sequences upstream from potential CDSs of Prokka-annotated picobirnaviruses were extracted using extractUpStreamDNA (https://github.com/ajvilleg/extractUpStreamDNA), and all 5= UTRs and transcription start sites were manually verified in UGene (89). These extracted sequences were then subjected to a motif search using the MEME Suite (90,91).
Accession numbers. Read and contig data sets are available from NCBI under the following BioProject accession numbers: PRJNA421889 (wastewater data), PRJNA421892 (sediment data), and PRJNA421894 (estuarine water data). The NoV GI genome isolate was deposited in GenBank under accession number MG599789.