Orphan Genes Shared by Pathogenic Genomes Are More Associated with Bacterial Pathogenicity

Recent pangenome analyses of numerous bacterial species have suggested that each genome of a single species may have a significant fraction of its gene content unique or shared by a very few genomes (i.e., ORFans). We selected nine bacterial genera, each containing at least five pathogenic and five nonpathogenic genomes, to compare their ORFans in relation to pathogenicity-related genes. Pathogens in these genera are known to cause a number of common and devastating human diseases such as pneumonia, diphtheria, melioidosis, and tuberculosis. Thus, they are worthy of in-depth systems microbiology investigations, including the comparative study of ORFans between pathogens and nonpathogens. We provide direct evidence to suggest that ORFans shared by more pathogens are more associated with pathogenicity-related genes and thus are more important targets for development of new diagnostic markers or therapeutic drugs for bacterial infectious diseases.

Earlier studies found that ORFans are shorter, have lower GC content, and evolve more rapidly (6)(7)(8)(9)(10). Therefore, ORFans were once thought to be mispredicted proteincoding genes. However, accumulating experimental evidence has been demonstrated that many ORFans correspond to real and functional proteins (7,(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24). In addition, it has been suggested that newly evolved ORFan genes often confer new traits and play significant roles in assisting their host organisms to adapt to the ever-changing environments (5,9). For example, an ORFan gene named neaT was characterized in extraintestinal pathogenic (ExPEC) Escherichia coli to have a key role in the virulence of ExPEC in zebrafish embryos (24). Therefore, although molecular biologists tend to focus more on conserved genes, the taxonomically restricted ORFans are likely to be more important for the emergence of species-specific traits: e.g., the ability of pathogens to infect their hosts.
Previously, ORFans have been shown to be enriched in genomic islands (GIs) of bacterial genomes (25). GIs are defined as horizontally transferred gene (HGT) clusters that often contain virulence factor (VF) genes and can transform nonpathogens to pathogens. Hence, many GIs are also known as pathogenicity islands (PAIs), a term we prefer to use in this article. In fact, PAIs were shown to contain more VF genes than the rest of the genome (26). Another study showed that 39% of ORFans in 119 prokaryotic genomes were found in clusters of genes with atypical base compositions (27), which correspond to horizontally transferred foreign elements from other bacteria or viruses. However, none of the previous large-scale analyses of prokaryotic ORFans (e.g., references 4, 28, 29, and 30) have distinguished pathogens and nonpathogens.
Recent pangenome analyses of numerous bacterial pathogens and their closely related nonpathogenic strains have suggested that each genome of a single species may have a significant fraction of unique gene content known as the variable genome (31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41). Many of the unique genes are lineage-specific ORFans; those unique genes residing in PAIs or prophages may have contributed to the bacterial pathogenicity (42,43).
In this study, our goal was to study the association between ORFans and pathogenicity of bacteria by analyzing fully sequenced bacterial genomes, which have been classified into pathogen (P) and nonpathogen (NP) groups. We identified ORFans adopting the pangenome idea, according to which proteins from the variable genome are ORFans. Compared to previous studies, the novelty of this study is that we have classified ORFans into different groups: SS-ORFans (strain-specific ORFans present in just one genome), PS-ORFans (pathogen-specific ORFans shared by pathogenic genomes), and NS-ORFans (nonpathogen-specific ORFans shared by nonpathogenic genomes).
Specifically, using bacterial genomes from nine bacterial genera, we aimed to address the following questions by comparing genomes of the same genus. (i) Do pathogens have more genes than nonpathogens? (ii) Do pathogens have a higher percentage of ORFans than nonpathogens? (iii) Do pathogens have more pathogenicity-related genes (PRGs), such as genes in prophages and PAIs and genes identified as HGTs and VFs, than nonpathogens? (iv) Which group of ORFans is more represented in the four types of PRGs and thus is more likely to be associated with bacterial pathogenicity?

RESULTS
Overall comparisons of ORFans between pathogens and nonpathogens in nine genera. The nine bacterial genera with more than five complete pathogenic genomes and five complete nonpathogenic genomes are shown in Table 1 (also see Materials and Methods). Here "complete" means that the genomes are fully determined and assembled. Bacteria of these genera are known to cause a number of common and devastating human diseases (see Table S1 in the supplemental material). Table 2, the 505 genomes are grouped into 340 pathogenic (P) genomes (1,255,580 proteins) and 165 nonpathogenic (NP) genomes (657,172 proteins). The percentages of ORFans are calculated relative to the gene contents in the two groups of genomes, respectively (see Fig. 1 and Materials and Methods for how we defined the four groups of ORFans). In the 340 P genomes, the percentage of SS-ORFans is 1.39% and the percentage of PS-ORFans is 4.48%. Similarly, in the 165 NP genomes, the percentage of SS-ORFans is 2.60% and the percentage of NS-ORFans is 6.00%. Hence, the overall percentage of ORFans seems higher in NP than P genomes, which agrees with a previous study (19% nonpathogen-associated genes versus 14% pathogen-associated genes) (26).

As shown in
Furthermore, Table 2 also shows the four groups of ORFans further broken into the four types of PRGs (pathogenicity-related genes [explained in Materials and Methods]). For example, the percentage of SS-ORFans in P genomes carried by prophages is 12.24%, which was calculated by no. of SS-ORFans in prophages/total no. of SS-ORFans: 2,138/17,455. For prophages and PAIs, it is clear that ORFans of P genomes are more likely to be carried by PAIs and prophages than ORFans of NP genomes (e.g., for prophages, P genomes [18.75% ϩ 12.24%] versus NP genomes [9.50% ϩ 8.54%]). When looking at different ORFan groups, the percentage of PS-ORFans is always the highest (18.75% for prophages and 30.41% for PAIs). Additionally, it appears that ORFans are more likely to be carried by PAIs and prophages than non-ORFans in both P and NP genomes, which extends the finding made in reference 25.
For VFs, the numbers of ORFans annotated as VFs are very small, in contrast to much larger numbers for non-ORFans. Notably, 259 (0.66%) NS-ORFans are VFs, compared to 2,718 (4.84%) PS-ORFans being VFs. A previous study has shown that VFs are highly enriched in PAIs compared to non-PAI regions (26). Interestingly, here we showed that most VFs are found in non-ORFans (more conserved genes shared by P and NP genomes). This is likely because, as indicated in reference 26, there are VFs commonly found in P and NP genomes, which are more abundant in bacterial genomes than those pathogen-associated VFs.
For HGTs, non-ORFans were excluded in our HGT identification because they do not qualify, "having limited blastp hits in taxonomically close (genus-level) genomes" (see Materials and Methods). Table 2 shows that NP genomes have higher percentages of ORFans identified as HGTs than P genomes, contrary to the other three types of PRGs.
However, it should be noted that Table 2 combined ORFans of the nine genera as a whole for comparisons. Thus, the above observations could be biased due to the fact that some genera have more genomes (e.g., Streptococcus) or have better-annotated PRGs (e.g., Escherichia) than others. To obtain more statistically robust results without biases, we have counted the number of ORFans in each genome (see Data Set S1 in the supplemental material), calculated the percentages, and further statistically compared the P and NP genomes in each genus.
Pathogens do not always have more genes than nonpathogens. The pairwise nonparametric Wilcoxon test P values (the second column of Table 3) show that not all genera have their P genomes carrying more genes than NP genomes. In four out of the nine genera: Bacillus, Escherichia, Pseudomonas, and Streptococcus, the P genomes have a higher number of genes than NP genomes. However, it is the opposite in three other genera: Clostridium, Corynebacterium, and Mycobacterium. This result remains the same even when excluding plasmids in the analysis. This finding largely agrees with a previous study (44), which compared the number of genes in four genera (Bacillus, Escherichia, Pseudomonas, and Burkholderia) using a smaller data set. Entwistle et al.

Pathogens do not always have more PRGs than nonpathogens.
In Table 3, we have also compared the percentage of PRGs between P and NP genomes in each genus. (Detailed counts are available in Data Set S1.) For prophage-carried genes, Table 3 shows that, although in Escherichia, pathogens tend to have more genes located in prophages than nonpathogens (44), in the other eight genera pathogens do not have more prophages than nonpathogens. For PAIs, in two genera (Burkholderia and Escherichia), the percentage of genes located in PAIs is higher in P genomes, while in two other genera (Clostridium and Pseudomonas), it is the opposite. Thus, it was inaccurate to conclude based on Table 2 that there is a higher percentage of prophages and PAIs in P genomes of all nine genera, because this is only true for Escherichia (Table 3), which dominated the prophage and PAI data.
For VFs, four genera (Corynebacterium, Listeria, Mycobacterium, and Pseudomonas) have a higher percentage of VF-carried genes in P than NP genomes. Lastly, for HGTs, four genera (Burkholderia, Clostridium, Corynebacterium, and Mycobacterium) have a lower percentage of ORFans derived from HGT in P than NP genomes. Therefore, the genus-by-genus statistical tests showed that pathogens do not always have more PRGs than nonpathogens, and the observations vary between different genera.
The percentage of PS-ORFans is always higher than that of SS-ORFans in pathogens, which is not true in nonpathogens. When taking the P and NP genomes of the nine genera as a whole for comparison, a sequence of percentages was observed in Table 2: % NS-ORFans (NP) Ͼ % PS-ORFans (P) Ͼ % SS-ORFans (NP) Ͼ % SS-ORFans (P). For more accurate comparisons without bias from combining different genera, we have performed genus-by-genus statistical tests, and for each genus, four comparisons with the four groups of ORFans have been made (see Fig. 2

legend).
Wilcoxon nonparametric test P values for these comparisons can be found in Table S2 in the supplemental material. The detailed counts of different ORFans are available in Data Set S1.
For the comparison SS-ORFans (P) versus SS-ORFans (NP), only in Escherichia was the percentage of SS-ORFans (P) significantly higher than the percentage of SS-ORFans (NP); in six genera (Burkholderia, Corynebacterium, Listeria, Mycobacterium, Pseudomonas, and Streptococcus), it is the opposite.
For the comparison PS-ORFans (P) versus NS-ORFans (NP), in three genera (Escherichia, Burkholderia, and Streptococcus), the percentage of PS-ORFans is significantly higher than the percentage of NS-ORFans; however, in three other genera (Bacillus, Corynebacterium, and Pseudomonas), it is the opposite. All of these findings suggest that nonpathogens do not necessarily have more ORFans than pathogens, because different genera behave differently.
For the comparison PS-ORFans (P) versus SS-ORFans (P), in the nine genera, the percentage of PS-ORFans is always significantly higher than the percentage of SS-ORFans. This suggests that ORFans tend to be shared by different pathogenic genomes.
However, for the comparison NS-ORFans versus SS-ORFans (NP), in four genera (Bacillus, Clostridium, Corynebacterium, and Pseudomonas), the percentage of NS-ORFans is significantly higher than the percentage of SS-ORFans, while in Escherichia, the percentage of NS-ORFans is significantly lower than the percentage of SS-ORFans, and in the other four genera, there is no significant difference. Therefore, unlike P genomes, NS-ORFans are not always more abundant than SS-ORFans in NP genomes.
PS-ORFans are always more abundant than SS-ORFans in PRGs in pathogens, which is not true in nonpathogens. We continued by comparing the percentages of  Table 4, PAIs in Table 5, VFs in Table 6, and HGTs in Table 7), which is a novel analysis of this study. For prophages, PAIs, and VFs, we first compiled a list of proteins encoded by these PRGs in each genome, and then we separated PRGs into SS-ORFans, PS-ORFans, and non-ORFans in pathogenic (P) genomes and into SS-ORFans, NS-ORFans, and non-ORFans in nonpathogenic (NP) genomes. Lastly, we calculated their percentages for Wilcoxon tests. For HGTs, non-ORFans were excluded in the Wilcoxon tests of Table 7. The detailed counts of different ORFans in different PRGs are available in Data Set S1. The most interesting observation from Tables 4 to 7 is that the percentage of PS-ORFans is significantly higher than percentage of SS-ORFans in P genomes of almost all the genera for all the four types of PRGs. (Listeria in Table 6 has a P value of 0.5, because only 1 out of the 40 Listeria genomes has VFs, and thus, the P value is not meaningful.) This also agrees with the finding made in Fig. 2 and Table S2 that in P genomes of the nine genera, the percentage of PS-ORFans is always higher than the percentage of SS-ORFans.
This finding suggests that PS-ORFans (shared by multiple P genomes) are more associated with bacterial pathogenicity than SS-ORFans (unique in each genome). In contrast, in NP genomes, the comparison of the percentages of PS-ORFans and SS-ORFans for the four types of PRGs does not show such uniformity. Particularly, for prophages and PAIs (Tables 4 and 5), most of the genera show no significant difference.  To study what functions are overrepresented in ORFans, we have compared the GO annotation of our four ORFan data sets against that of a protein data set randomly selected from the entire gene content of the nine genera. A binomial test was run on each GO term to test if the ORFan count is significantly higher than the random protein count. Data Set S2 in the supplemental material provides the top-ranked GO terms that are significantly overrepresented in the four groups of ORFans. As expected, GO terms related to phages (such as DNA integration, virus tail fiber assembly, and viral genome ejection) are among the most overrepresented functions found in PS-ORFans. Interest-   ingly, DNA integration is also in the top 10 GO terms found in the other three ORFan groups. In addition, two GO terms (DNA excision [related to DNA repair after recombination] and response to nutrient [related to extracellular stimulus]) are found in the top 10 terms for three of the four ORFan groups. A database of ORFans of pathogenic bacteria. All the ORFan data generated in this study are provided through an online database, ORFanDB (http://cys.bios.niu.edu/ ORFanDB/). The website features an embedded interactive web application that allows a user to select a species and then further narrows their selection based on strain and ORFan type using a set of nested tabs. The final nested tab ("Protein Information") reveals data about the ORFan, such as hits in PRGs, a Jbrowser instance showing the genomic neighborhood, and genome metadata curated from JGI (Joint Genomic Institute). There is also a download page from which the user can download all the data available, genus-specific data, or ORFan type-specific data. Lastly, a help page and an about page are created to provide the user with information on how to use the application.

DISCUSSION
Previous literature has studied the four types of pathogenicity-related genes (PRGs) using comparative genomics approaches (25)(26)(27)44). Two papers have specifically compared prophages (44) and VFs (26) between pathogens and nonpathogens. In addition, we and others have focused on developing new computational methods for the identification of ORFans in hundreds of bacterial genomes and metagenomes (2)(3)(4)6). Despite these previous efforts, the novelty of the current work is that we have separated ORFans into four different groups, which enabled us to compare them within/between pathogens and nonpathogens of the same bacterial genus, particularly in terms of their relative abundance in the four types of PRGs.
Before this study, the previous literature had already suggested that (i) at least in some genera, P genomes are larger than NP genomes (44), (ii) ORFans are overrepresented in PAIs compared to the rest of the genome (25), and (iii) combining genomes from different genera, overall, P genomes have fewer ORFans than NP genomes (26).
Our data extended these findings. For example, for finding i, Table 3 showed that in four out of nine genera, P genomes have more genes than NP genomes, whereas in the other five genera, this is not true. For finding ii, the previous finding was extended with four groups of ORFans in Table 2, which showed the following for genes located in PAIs: % PS-ORFans (P) Ͼ % SS-ORFans (NP) Ϸ % SS-ORFans (P) Ͼ % NS-ORFans (NP) Ͼ Ͼ % non-ORFans (P) Ͼ % non-ORFans (NP). This finding was also extended to prophages, showing the following: % PS-ORFans (P) Ͼ % SS-ORFans (P) Ͼ % NS-ORFans (NP) Ͼ % SS-ORFans (NP) Ͼ Ͼ % non-ORFans (P) Ͼ % non-ORFans (NP).
For finding iii, Table 2 confirmed that NP genomes have a higher overall percentage of ORFans than NP genomes, but also showed that the percentage of SS-ORFans (NP) is higher than the percentage of SS-ORFans (P), and the percentage of NS-ORFans (NP) is higher than the percentage of PS-ORFans (P). However, we argued that an unbiased genus-by-genus comparison was required to obtain a more accurate result. When comparing them in each genus ( Fig. 2 and Table S2), the percentages of NS-ORFans (NP) and SS-ORFans (NP) were no longer always higher than those of PS-ORFans (P) and SS-ORFans (P), respectively. For example, in Escherichia, the percentage of PS-ORFans (P) was significantly higher than that of NS-ORFans (NP) and the percentage of SS-ORFans (P) was significantly higher than that of SS-ORFans (NP).
The most significant findings of this study are that in pathogens of the nine genera, the percentage of PS-ORFans was consistently higher than that of SS-ORFans ( Fig. 2 and Table S2), and the percentage of PS-ORFans annotated to be PRGs (all the four types) was also consistently higher than that of SS-ORFans (Tables 4 to 7). These findings were even more intriguing when seeing in nonpathogens of the nine genera that such a strong and uniform pattern (i.e., % NS-ORFans Ͼ % SS-ORFans) across all the nine genera did not exist.
To add even more support for these findings, we have run "all versus all" blastp search on the 56,196 PS-ORFan and 39,437 NS-ORFan data sets (Table 2) separately. Then we counted how many genera each query ORFan had hits in. In total, 2,437 (4.34%) PS-ORFans and 2,088 (5.29%) NS-ORFans also have blastp hits in other genera than their self-genus. After grouping ORFans based on the number of genera (ORFan conservation), we plotted the percentages of each group matching prophages and PAIs and observed a positive correlation for PS-ORFans but not for NS-ORFans (Fig. 3). We also did the same for VFs and HGTs (see Table S4 in the supplemental material). VFs showed a similar pattern, but the numbers were too small to be significant. HGTs had positive correlations in both PS-ORFans and NS-ORFans. Overall, this further suggests that the more conserved PS-ORFans (found in more genera) are, the more likely they are pathogenicity related. In contrast, this is not true for NS-ORFans-at least in prophages and PAIs. From the evolutionary selection perspective, new genes from phages, distant bacteria, PAIs, and other mobile genetic elements can constantly enter the host genome through horizontal gene transfer; however, these new genes have to go through the natural selection process, where only those providing selective advantage to their bacterial hosts (i.e., pathogenicity) are eventually fixed in the pathogen population (e.g., found in multiple pathogenic genomes of the same genus).
It should be mentioned that such an HGT selection model works for any genes and any biological processes in any genomes. Notably, in nonpathogens, we also observed a significant percentage of ORFans and PRGs (Table 2). However, the selection of PRGs and ORFans in nonpathogens may not be as strong and universal as in pathogens.
These findings strongly suggest that the PS-ORFans that are shared by multiple pathogens have a higher success rate to transform a nonpathogen to a pathogen compared to SS-ORFans. Therefore, PS-ORFans should be considered better targets to identify novel PRGs and to develop diagnostic/therapeutic drugs.
Lastly, other than ORFans that originated through horizontal gene transfer (gene gain) from phages or other bacteria, there are other important factors that can also account for bacterial pathogenicity, such as gene loss due to genome reduction (i.e., smaller P genomes), modification of the core genome (non-ORFans) with single nucleotide polymorphisms (SNPs), indels, and recombinations (42,43,45). Although not a focus of this study, some of these factors such as SNPs found in PRGs of non-ORFans may be a more plausible reason for infectious disease outbreaks, which usually happen in a relatively short evolutionary time scale, as revealed by the numerous recent whole-genome shotgun sequencing efforts for genomic epidemiology studies (e.g., reviewed in references 46, 47, and 48).    Table S4.

MATERIALS AND METHODS
Genome data. In total, 6,005 completely sequenced and assembled bacterial genomes were downloaded from the RefSeq database (ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria) as of August 2017, denoted as Bacteria-DB.
A list of bacterial genomes at http://www.pathogenomics.sfu.ca/pathogen-associated/2014/ was manually curated and classified into pathogen (P) and nonpathogen (NP) groups by the Brinkman lab (26). As this list was from an older version of the RefSeq database, there were a smaller number of genomes curated and available in the above web link than the Bacteria-DB we used. The 2,864 GenBank accession numbers (ACs) of these genomes were used to extract their RefSeq data files (genomic fna, protein faa, etc.) from the Bacteria-DB. Out of the 2,864 ACs, 2,479 were found in Bacteria-DB. Nine genera with Ͼ5 pathogenic and Ͼ5 nonpathogenic genomes (in total, 505 genomes) were kept for further analyses.
ORFan identification. As shown in Fig. 1, for each bacterial genus, we used all of its genomes (P and NP) to make a combined proteome (all proteins of a genome). We then ran an "all versus all" blastp search (E value of Ͻ0.01) using DIAMOND (49), and based on the search result, we classified proteins of each genome into the following: 1. SS-ORFans: strain-specific ORFans, defined as proteins with DIAMOND hits restricted to the query genome (two groups of SS-ORFans: those from P and those from NP) 2. PS-ORFans (only in P): pathogen-specific ORFans, defined as proteins with DIAMOND hits restricted to Ն2 pathogenic genomes 3. NS-ORFans (only in NP): non-pathogen-specific ORFans, defined as proteins with DIAMOND hits restricted to Ն2 non-pathogenic genomes 4. Non-ORFans: defined as the rest of proteins in the genomes PRGs. Four types of genes were identified in the 505 genomes: prophage genes, PAI genes, VF genes, and HGT genes.
The genomic locations of ORFans were compared to the genomic locations of prophages in the PHASTER database (50) and to the genomic locations of PAIs in the IslandViewer database (51). The ORFan genes in prophages and PAIs were then classified into SS-ORFans, PS-ORFans, and NS-ORFans groups.
To determine if an ORFan is a virulence factor (VF) gene, ORFan sequences were blastp searched against the VFDB (52) using DIAMOND (E value of Ͻ1eϪ5).
Horizontally transferred (HGT) genes were identified as proteins having limited blastp hits in taxonomically close (genus-level) genomes but more hits in taxonomically distant (order-level) genomes. To determine if an ORFan is horizontally transferred, ORFan sequences were blastp searched against the protein sequences of the Bacteria-DB (6,005 genomes of various taxonomic phyla) using DIAMOND (E value of Ͻ1eϪ5). We defined an ORFan to be horizontally transferred if it has very few blastp homologs within the studied genus, but has blastp homologs in other taxonomic orders. Specifically, the DIAMOND result was filtered to remove all hits of the same genus as the ORFan query. Then the taxonomic lineages of the remaining hits were examined. If the ORFan has all its remaining hits from different taxonomic orders (two levels up from genus in the taxonomy hierarchy), it means that the ORFan does not have blastp hits in other genomes of the same genus than those used for ORFan identification, but has hits in genomes of more distant orders. This is evidence of gene transfer from distant organisms, and such ORFans were retrieved as HGTs.
For example, a PS-ORFan protein, WP_001086421.1, from Escherichia coli APEC O1 (GCF_000014845) has a small number of blastp hits within the Escherichia genus (all hits are from pathogenic genomes) and no other hits within the Enterobacterales order. However, it has numerous hits in other orders of the Gammaproteobacteria class and orders of other bacterial phyla. Such atypical taxonomic distribution of WP_001086421.1's blastp hits can be explained either by HGT from distant organisms into pathogens of the Escherichia genus or by massive gene loss within the Enterobacterales order. As the Enterobacterales order is one of the most sequenced bacterial orders (thousands of genomes in Bacteria-DB), the chance of massive and independent gene loss is much smaller than the chance of recent HGT. This is true for all the genomes of the nine genera, for they are all from well-represented orders in the genome database.
Functional annotation of ORFans. We modified a workflow reported in reference 3 to annotate ORFans for Gene Ontology functional descriptions. DIAMOND was used to compare all the ORFans to the UniProt database. The best hit of each ORFan was kept if the alignment identity was Ն80% and the E value was Յ0.01. The GO terms of the UniProt hits were then assigned to the ORFans by parsing the UniProt ID mapping file downloaded from the UniProt ftp site. In total, 39,330 ORFans were annotated with GO using UniProt2GO.
ORFans that were not annotated by UniProt2GO were then compared to the PDB70 database using the more sensitive profile-based tool hhsearch (53). The results were parsed to keep the best hit if the probability threshold was Ն80% and the E value was Յ1. The GO terms of the PDB hits were then assigned to the ORFans by parsing the PDB2GO mapping file downloaded from the GOA (GO annotation) ftp site. In total, 13,053 ORFans were annotated with GO using PDB2GO. Altogether, 52,383 ORFans were mapped to GO terms.
For GO enrichment analysis, 100,000 proteins were randomly selected from the nine genera, and subjected to the same workflow to be mapped to GO terms. The R function binom.test was used to compare the number of ORFans with a specific GO term (limited to the 5th level of GO terms from BP [biological process] and MF [molecular function] categories) to the number of random genes with the same GO term. P.adjust in R was used to adjust for multiple comparisons.
Data availability. The data from this study were organized into a MySQL database. A web application was written in R, using primarily the Shiny package, to provide a user interface to explore these data. Shiny Server was used to host the publicly available website, ORFanDB, in which all of the ORFan data have been made available (http://cys.bios.niu.edu/ORFanDB/).