Epstein-Barr Virus-Positive Cancers Show Altered B-Cell Clonality

Around 20% of human cancers are associated with viruses. Epstein-Barr virus (EBV) contributes to gastric cancer, nasopharyngeal carcinoma, and certain lymphomas, but its role in other cancer types remains controversial. We assessed the prevalence of EBV in RNA-seq from 32 tumor types in the Cancer Genome Atlas Project (TCGA) and found EBV to be present in >5% of samples in 12 tumor types. EBV infects epithelial cells and B cells and in B cells causes proliferation. We hypothesized that the low expression of EBV in most of the tumor types was due to infiltration of B cells into the tumor. The increase in B-cell abundance and diversity in subjects where EBV was detected in the tumors strengthens this hypothesis. Overall, we found that EBV was associated with an increased and altered immune response. This result is not evidence of causality, but a potential novel biomarker for tumor immune status.

stringent criteria, such as requiring at minimum 1,000 bp of combined genome coverage or Ͼ1,000 individual reads/sample. We hypothesized that the tumor represents a specialized environment within the body, which may selectively attract, retain, and/or foster localized replication of viruses and of virus-infected cells, even if these viruses did not have a causal role in tumor development. These would be present at low levels. Accordingly, it is important to view the results presented here not as evidence for causality, but as potentially novel biomarkers with clinical implications.
Epstein-Barr virus (EBV) is a human herpesvirus, and like the other seven human herpesviruses, EBV establishes lifelong persistence in the human host, a state that is termed latency. Over 90% of adults are infected with EBV by age 50, and at any given time, a third of seropositive persons shed infectious virus in saliva (8). Once infected, the EBV genome is continuously present in circulating CD38-positive memory B cells (9). During this state, no or only a very few viral mRNAs are transcribed. Spontaneous mononucleosis or immunodeficiency-associated reactivation events result in amplification of EBV-infected cells and extended mRNA transcription. De novo B-cell and epithelial cell infection are associated with widespread, promiscuous transcription across the entire ϳ172,000-bp viral genome. Unlike the other human herpesviruses, except for human herpesvirus 8 (Kaposi sarcoma-associated herpesvirus), EBV is strongly associated with cancer and able to transform primary human B cells in culture. There exists sufficient evidence to associate EBV infection with Burkitt lymphoma (BL), Hodgkin's disease (HD), posttransplant lymphoproliferative disease (PTLD), and rare forms of non-Hodgkin's lymphoma, such as those localized to the central nervous system in AIDS patients (AIDS-CNS lymphoma) (1). These cancers originate from B cells, and the virus is present at multiple copies in every tumor cell. The exception is HD, where only a fraction of the proliferative compartment, the Reed-Sternberg cell, is EBV positive. HD is one example of a cancer that is convincingly and causally virus associated, yet the virus is present in only a fraction of the cells in the lesion. In addition, EBV is associated with the epithelial cancers nasopharyngeal carcinoma (NPC) and gastric cancer. Here again, the EBV genome and EBV mRNAs are found in every single tumor cell. Because cancers are a heterogeneous group of diseases, the EBV association ranges from Ͼ90% in AIDS-CNS lymphoma to approximately 10% in gastric cancer (10). Furthermore, many associations depend on geographic locale. For instance, most cases of EBV-positive NPC are observed in southern China and among the Inuit in the Arctic (11).
In addition to these bona fide associations, which are verified by multiple, independent epidemiological studies, EBV DNA, protein, and RNAs have been found in the biopsy material of many other cancers. In the case of breast cancers, discordant results "for" and "against" the presence of EBV in a given tumor have been published (12,13). In these cases, it is unclear if (i) EBV contributes mechanistically to tumor development, (ii) is a "passenger" virus due to the presence of infected, infiltrating B cells, or (iii) represents isolated instances of local abortive infection. To more broadly explore the association between EBV and cancers of different lineages, we queried the entire TCGA mRNA-seq data set, representing 10,396 cancer cases, for the presence of EBV transcripts.

RESULTS
Epstein-Barr virus is present in many tumor types. To detect viruses from RNA-seq data, we used VirDetect, software based on the principles of digital subtraction (14). We deployed the algorithm on the Google Compute platform. We developed VirDetect for two reasons: (i) we could not find public work flows for detection of many viruses at once, and (ii) many of the published work flows require manual curation after alignment due to poor specificity. The poor specificity is caused by the areas of low complexity and human homology found in some viral genomes. The added value of VirDetect is the masked genome, which leads to very few false-positive alignments. To test VirDetect's ability to detect selected viruses, we simulated random reads from EBV and four other viruses from different genomic classes (see Fig. S1 in the supplemental material). We estimated the false-positive rate as 0% and the false-negative rate as 8% for EBV: i.e., 8% of reads drawn from EBV did not realign back to an EBV genome as masked by VirDetect. These represent repeat and/or low-complexity regions that are present in the virus.
We used VirDetect to query all mRNA-seq samples from the TCGA (n ϭ 10,396 samples, 32 tumor types) for evidence of transcripts originating from any of 1,894 vertebrate viruses. EBV was present in at least some samples among most tumor types, and present in Ͼ10 samples in 13 tumor types (Fig. 1A). We required a minimum of 10 samples per tumor type for exploratory association analyses; further analyses focused only on these tumor types. The EBV mRNA-positive (EBV ϩ ) tumor types included stomach cancer (STAD) and diffuse large B-cell lymphoma (DLBCL), as expected. The remaining tumor types were breast, bladder, cervical, colon, esophageal, head and neck, liver, lung (both adenocarcinoma and squamous cell carcinoma), rectal, and skin cutaneous melanoma. As a positive control, we used human endogenous retrovirus transcription (HERV-K group) mRNA, which was detected in every sample (15).
EBV was present in only a few DLBCL samples, as the TCGA data set is comprised largely of sporadic DLBCL and contains very few pediatric Burkitt, CNS, or HIVassociated DLBCLs. In one striking case, EBV-mapped read counts peaked at over 100,000 reads, which amounted to 0.5% of all mapped reads for this sample (Fig. 1B). EBV was detected in 45% of STAD samples. The expression of EBV in STAD was bimodal, with two populations (a) ranging from 2 to 200 and (b) 1,000 to 100,000 counts (   10 [raw EBV counts]) for tumor types with at least 10 subjects with EBV expression. Boxes represent median Ϯ interquartile range and whiskers median Ϯ 1.5ϫ interquartile range. The color of the box represents if the tumor type was determined by WHO to be associated with EBV (blue) or not (black). The color of the dot represents EBV transcript level (gray, Ͻ10; yellow, 10 to 100; orange, 100 to 1,000; red, Ͼ1,000. (C) Density distribution for all tumor types except gastric cancer (STAD [gray]), colon, rectal, and esophageal combined (black), and STAD (purple). Note the pronounced bimodal distribution of EBV read count for STAD. and C). STAD samples with the higher range of EBV expression were independently determined as belonging to the EBV molecular subtype (see Fig. S2 in the supplemental material) (16). For these, we expect that EBV was present in the epithelial lineage tumor cell itself. EBV expression in the lower range, which was predominant among all other positive cancer types, may be due to infiltrating B cells, as previously described by Ryan et al. (10) for gastric adenocarcinoma samples.
Esophageal carcinoma had the second highest percentage of samples with EBV mRNAs (40%), which may be due to some of the samples being in anatomical proximity to the stomach and occasionally indistinguishable from gastric cancer samples (17,18). The esophageal carcinoma samples collected from the junction between the esophagus and the stomach were significantly enriched for EBV ϩ samples compared to samples from other esophageal anatomical sites (chi-square test, P Յ 0.0004 [see Fig. S3 in the supplemental material]) and were among the "high" read class (red in Fig. 1B). EBV mRNA was also present in colon and rectal cancers at elevated frequencies, but with low read counts. These broad tumor types-stomach, esophageal, colon, and rectal-have a commonality in that each was from a digestive gastrointestinal organ.
A number of TCGA samples had adjacent nontumor material available from the same patient. Nine of 13 tumor types had Ͼ10 samples with matched adjacent tissue. Among this set of matched samples, EBV expression was enriched in the tumor overall (chi-square test, P Յ 6.6 ϫ 10 Ϫ8 ) and increased individually in most tumor types (see Fig. S4 in the supplemental material). Observing an enrichment in tumor tissue compared to nontumor tissue is consistent with the hypotheses that EBV (i) was present in the tumor cell itself or (ii) was present due to infiltrating B cells. Observing an enrichment is inconsistent with the hypothesis of random sample contamination. B cells should not be present at large amounts in nonlymphatic and noncancerous tissue, unless there is an infection, but B cells may be abundant in tumors due to increased angiogenesis or the immune response to the tumor.

Epstein-Barr virus is associated with B-cell signatures.
To formally test the hypothesis that EBV positivity was associated with a specific tumor resident or infiltrating cell type, we tested if any gene expression module (a group of correlated genes, summarized by a single value [19]) was differentially expressed between EBV ϩ and EBV Ϫ samples. Eighty out of 485 gene expression modules tested were significantly differentially expressed (Mann-Whitney U test, q Ͻ 0.01, after false-discovery rate [FDR] adjustment by the Benjamini-Hochberg procedure) in at least one tumor type and 43 in two or more tumor types ( Fig. 2A). Most of the modules were immune related, such as natural killer (NK) and B-and T-cell modules (all upregulated in EBV ϩ samples), consistent with the hypothesis that infiltrating immune cells are the carrier of the EBV signal. Other modules were tumor or inflammation related, such as P53 and STAT3. In 9/13 (69%) tumor types, B-cell-associated modules were significantly differentially expressed (maximum q value of Ͻ0.01 across these tumor types).
The B-cell gene expression module had significantly higher median expression in EBV ϩ samples compared to EBV Ϫ samples in 11/13 (84%) tumor types (linear regression conditioning on tumor type, P Ͻ 2 ϫ 10 Ϫ16 ) (Fig. 2B). As expected B-cell gene expression was not increased in EBV ϩ stomach cancer, as the virus resides in epithelial cells (Fig. 2B, orange). Also, as expected, DLBCL did not show an increase in B-cell gene expression, since by definition all DLBCLs are of a B-cell lineage. As a control, we evaluated the association between B-cell gene expression and human cytomegalovirus (HCMV/HHV5), a ubiquitous herpesvirus of comparable genome size. There was no significant association with the presence of this virus and B-cell gene expression, except in bladder cancer (unadjusted P value of Յ0.02 by Mann-Whitney U test, FDR-adjusted q value of Յ0.17) (Fig. 2B). The association between the B-cell gene expression and HCMV in bladder cancer is consistent with the biology of HCMV. Whereas there is not enough evidence to establish a direct connection between HCMV and bladder cancer, one would expect to find HCMV in bladder tissue, as this herpesvirus, unlike EBV, is transmitted to a large degree by urine and shed continuously throughout life (20).
Among the tumor types, melanoma and liver cancer had the greatest increase of the B-cell gene expression module in EBV ϩ samples compared to EBV Ϫ samples.
Memory B-cell phenotype is enriched in tumor samples with Epstein-Barr virus. EBV establishes latency in memory B cells, and at any given time, a percentage of EBV-positive CD20 ϩ IgD Ϫ CD38 ϩ memory B cells are circulating in the blood (9). In contrast, active EBV replication/reactivation induces a B-cell blast phenotype and results in the de novo infection of naive B and epithelial cells. To test the hypothesis that memory B cells were enriched in EBV ϩ tumors, we used machine learning to build a classifier from previously published transcription profiles of sorted human B cells (GSE56315) (21). The B-cell transcriptional profiles were divided into memory, centroblast, centrocyte, plasmablast, and naive B-cell types. We classified all the tumor samples according to which cell type had the greatest similarity by expression using distance-weighted discrimination (22,23). Using a generalized linear model and adjusting for tumor type, we found that the assigned memory phenotype probability was significantly higher (P Յ 4.1 ϫ 10 Ϫ7 ), while the naive, nonactivated, phenotype was significantly lower in the EBV ϩ samples (P Յ 1.1 ϫ 10 Ϫ12 ) ( Table 1). Assessing each of the 13 tumor types individually, we found that EBV-positive samples were enriched in the memory B-cell phenotype in 6/13 (46%) tumor types by chi-square test (P Յ 0.05) (Fig. 2C). This observation is consistent with the hypothesis that the tumor-associated EBV signal stems from tumor-infiltrating, perhaps activated, memory B cells rather than naive B cells.

Epstein-Barr virus is associated with increased B-cell diversity. EBV infection is known to induce cell proliferation and expansion of the infected B cells. Hence, one
would expect that a reactivation or de novo infection event would lead to a change in B-cell diversity. To assess B-cell diversity in EBV ϩ tumor samples, we assembled B-cell receptors (BCRs) using V'DJer (24) and calculated diversity using Shannon entropy. The discussed BCRs are from heavy-chain IgG, which is the chain associated with an activated B-cell response (naive B cells bear IgD or IgM isotypes). Using linear regression, conditioning on the 13 tumor types with more than 10 EBV samples, we found that samples with EBV had (i) significantly increased total BCR counts (P Յ 1.4 ϫ 10 Ϫ20 ) and (ii) higher BCR diversity (P Յ 8.3 ϫ 10 Ϫ27 ). We also tested each cancer type individually, using the Mann-Whitney U test. As expected, DLBCL, which is a monoclonal cancer of B cells, exhibited the lowest diversity (median ϭ 0.29) and had no difference between samples by EBV status (Fig. 3A). In contrast, stomach cancer exhibited the highest diversity and had a significant enrichment of diversity in EBV ϩ samples (P Ͻ 0.001) (Fig. 3A). Most tumor types had higher BCR abundance as well as BCR diversity in the EBV ϩ samples.
Shannon entropy is a composite measure of both species evenness and abundance. To determine if Shannon entropy was only higher due to its dependence on abundance, or whether there was truly a diversity difference in samples with EBV expression, we tested a model conditioning on all 13 cancer types together as well as counts and found that diversity was still significant after conditioning on counts (P Յ 3.61 ϫ 10 Ϫ14 by linear regression) (Fig. 3B). In particular, melanoma, lung adenocarcinoma, head and neck cancers, and bladder urothelial carcinoma were significantly associated with diversity after first conditioning on counts (P Յ 0.01, 0.004, 0.009, and 0.009, respectively).
We assessed if the presence of any other virus was associated with BCR abundance and diversity and found that only EBV was associated with these two parameters (see Fig. S5 in the supplemental material); however, as most of the other viruses were not as prevalent as EBV, we had less power to detect such an association. These data are consistent with enrichment of class-switched, polyclonal B cells in EBV ϩ tumor samples.

DISCUSSION
We affirmed an association of EBV with DLBCL and a subset of gastric cancer (18,25). In the case of esophageal carcinoma, the detection rate of EBV was associated with proximity to the stomach. As recently reported, it is difficult to distinguish some esophageal from gastric tumors (16,17). EBV is ubiquitous in the adult population (26). EBV genomes are detectable in 0.004 to 0.01% of mononuclear cells in a healthy human (9). It is thus not surprising that we detected low-level EBV transcription in many samples. This likely stems from infiltrating B cells that carry EBV, as we observed an association between the overall B-cell expression signature and the presence of EBV.
In addition to EBV, other herpesviruses are also ubiquitous in the human population, such as HCMV/HHV5. We observed evidence of HHV5 transcription in a number of samples. Yet, HHV5 was not associated with B-cell gene expression, consistent with HHV5 establishing latency in myeloid progenitors and replicating in endothelial cells, but not B cells (27,28).
In several tumor types, B-cell diversity was significantly increased even after first conditioning on B-cell content. This observation argues against selective retention or local expansion of a singular, EBV-infected clone. In a few tumor types, either EBV positivity alone or added as an interacting factor was associated with survival; however, we had limited confidence in these results due to the heterogeneity of how TCGA clinical data were collected and the heterogeneity of treatment of the cancer within a cohort. Nevertheless, it is easy and economical to test for EBV-specific transcripts in tumor biopsy specimens. Such a test may have utility as a predictive or prognostic biomarker. Examples for its use can be found in posttransplant lymphoproliferative disease, other types of EBV ϩ lymphoma (29)(30)(31), and NPC (32).
This study is an attempt to use systems biology to relate viral infections to immune responses as defined by transcript patterns, cell-type signatures, BCR diversity, and histopathology, such as inflammation and cancer immunotherapy. Immune signatures have been heralded as an assumption-free method of diagnosis driven by advances in machine learning, including many studies based on BCR diversity (33).The current data set was extremely heterogenous. This study could not decide whether EBV was first and inflammation and inflammatory cell (B, T, and NK) influx were second, or whether EBV-infected B cells were the result of tumor-associated inflammation, which attracted EBV-positive as well as EBV-negative B cells and led to EBV reactivation. Those insights are more likely to emerge in studies on specific tumors and populations: e.g., upper gastric carcinomas or response to immune modulatory therapy (34).
This study analyzed one of the most comprehensive mRNA-seq data sets to date to query the status of human viruses in cancer, both in terms of tumor diversity and total number of samples and reads. Prior studies had looked at the presence or absence of viruses in subsets of the TCGA (35)(36)(37)(38) or across the general population (39) and thus had less power to detect possible virus-tumor associations. We used a variant of digital subtraction on RNA-seq data (14), thus requiring that viral mRNAs or long noncoding RNAs (lncRNAs) were expressed. This data set does not capture microRNAs. Digital subtraction and variations thereof represent the most widely used approach to "hunt" for viruses in sequencing data (40,41) and have been credited with the detection of Merkel cell polyomavirus, among others (42). It has been validated for the detection of EBV in cell lines and other samples (39,43,44). We improved on the previous iterations of digital subtraction by first masking areas of low complexity and human homology. This masking step led to almost complete specificity. Hence, a less conservative threshold could be applied to identify EBV-positive samples than before. While this increased sensitivity is not necessarily supportive of establishing virus-tumor associations, it allows, for the first time exploration of an association between the presence of a virus (only EBV was found) and immune signatures of tumor-infiltrating cells.
As these were existing RNA-seq data, rather than biological specimens, we were not able to enrich for viral sequences by biochemical means prior to sequencing as in other studies using local patient cohorts (45)(46)(47). Thus, the sequencing depth of the original sample and the level of viral mRNAs limit this study's sensitivity. In the context of cancer biology, this could be considered a strength and measure of specificity, as one would expect human oncoviruses to transcribe copious amounts of at least one of their oncogenes or oncogenic RNAs.
Because of the biology of human tumor viruses, this approach differs from broad metagenomic surveys (48)-e.g., of environmental samples or of diseases caused by acutely replicating viruses. Metagenomics is the description of all genome sequences within a sample and based on the assumption that multiple species coexist and all have some bearing on the necessarily composite phenotype. In contrast, of all viral tumors described to date, only a single virus-in fact a single strain or clonal integration event-is found in high abundance in the tumor biopsy specimen, and that virus transcribes a defined set of viral oncogenes, may they be protein-coding or noncoding RNAs. (The exception here is cancers caused by retroviral integration, where the virus induced the aberrant expression of human, but not viral, oncogenes [49].) EBV may be the one human virus most useful to evaluate with regard to tumor immune status.

MATERIALS AND METHODS
Virus detection, VirDetect. VirDetect is available at https://github.com/dmarron/virdetect. RNA-seq reads were aligned to hg38 (without chrEBV) using STAR v2.4.2a (1,080 multi-maps, 10 mismatches). Unmapped reads were next aligned to a masked viral FASTA using STAR v2.4.2a (52 multi-maps, 5 mismatches). Vertebrate viral FASTA (1,894 viruses) was downloaded from GenBank. Viral FASTA was masked for increased specificity. Regions were masked in two ways. (i) Viral reads of length 75 were simulated from the entire viral FASTA and then mapped to hg38 using STAR v2.4.2a (1,080 multi-maps, 5 mismatches). If the virus simulated reads mapped to the human genome, they were masked in the viral FASTA. (ii) Areas of low complexity (9 or more repeating single nucleotides, 7 or more repeating double nucleotides, 4 or more repeating nucleotide patterns of 3, 3 or more repeating nucleotide patterns of 4, 2 or more repeating patterns of 5, or 2 or more repeating nucleotide patterns of 5) were masked. Viruses were then quantified using the resultant SAM file.
Computing gene expression modules. Gene expression modules are groups of highly correlated genes together in a "module" (50). Each module was computed by Z-scaling each of the genes and then taking the median of value of the scaled genes.
Assembly of B-cell receptors. B-cell receptor repertoires were assembled for the immunoglobulin heavy chain across all TCGA mRNA-seq samples using V'DJer as described by Mose et al. (24).
B-cell-type classification. BAGS classifier (GSE56315) was built using linear distance-weighted discrimination (dwdLinear from the R package Caret) of genes with a standard deviation of log 2transformed RNA-seq of Ͼ0.2. The classification subtype of each sample was called by the subclassification with the highest probability.
Statistical analysis. All statistical analyses and plots were generated using R. Kaplan-Meier plots and Cox proportional hazards regression models were implemented using the R package survival. The R packages used for analyses were stats, plyr, reshape2, doMC, and caret. The R packages used for generating plots are ggplot2 and survival.