Listeriomics: an Interactive Web Platform for Systems Biology of Listeria

In the last decades, Listeria has become a key model organism for the study of host-pathogen interactions, noncoding RNA regulation, and bacterial adaptation to stress. To study these mechanisms, several genomics, transcriptomics, and proteomics data sets have been produced. We have developed Listeriomics, an interactive web platform to browse and correlate these heterogeneous sources of information. Our website will allow listeriologists and microbiologists to decipher key regulation mechanism by using a systems biology approach.


RESULTS
The Listeriomics web interface. Genomic, transcriptomic, or proteomic data can be browsed by using the Listeriomics website (http://listeriomics.pasteur.fr/) main page ( Fig. 1; Table 1; see Fig. S1 in the supplemental material). For each type of data, we designed a summary panel to navigate through the different data sets. The top banner of the website gives direct access to them. As summarized in Table 1, users can search 83 complete Listeria genomes and browse 492 transcriptome and 74 proteome data sets. Listeriomics integrates four tools for omics data management, i.e., (i) a genome viewer for displaying gene expression array, tiling array, and sequencing data along with proteomics and genomics data; (ii) an expression atlas and protein atlas that 83  connect every genomic element (genes, small RNAs, asRNAs) to the most relevant omics data; (iii) a protein conservation tool for the direct visualization of the presence or absence of a protein in a specific Listeria strain; and (iv) a coexpression network analysis tool for the visualization of genome features with the same expression profile.
The genomic interface is designed to browse every complete genome of the Listeriomics resource. Users can access strain name, serotype, lineage, and isolation information, along with a complete phylogenomic tree of Listeria strains (Fig. 1). From this table, scientists can access all the annotated genes of a specific strain. For each Listeria gene, five different information panels are available. The first panel shows all the general information about the position of the gene, its predicted annotated function. DNA and amino acid sequences can be accessed and saved as FASTA files or sent directly for a BLASTn or BLASTp search (53). The predicted subcellular localization (cytoplasm, cytoplasmic membrane, cell wall, cell surface, and extracellular milieu [27]) of each protein is also displayed along with information about the secretion pathway possibly used by the protein. The second panel provides an instant view of the conservation of a specific protein in other Listeria strains. This panel dynamically displays homologs on the Listeria reference tree in each existing Listeria strain. It also displays a summary table of all the homologous proteins with their similarity percentages and amino acid sequences. Users can also create a multialignment file of the homologous proteins. With the third panel, the user can visualize the protein locus synteny in all Listeria strains. We built an external synteny website by using the SynTView architecture (54). A fourth panel uses the expression atlas to show in which transcriptomics data sets the selected gene is differently expressed. The fifth panel displays every proteomics data set in which the protein encoded by the selected gene has been detected. Finally, from the home webpage, a summary panel with all the small RNAs in L. monocytogenes EGD-e can be accessed (Fig. 1). For each noncoding RNA element, one can display its position, its nucleotide sequence, its predicted secondary structure at 37°C, and a table displaying all supplementary information provided in source references (22,(55)(56)(57)(58).
In the transcriptomic interface of the Listeriomics website, researchers can access all the Listeria transcriptomic data sets published so far. A searchable table shows every data set available with a precise description of the biological conditions studied. In total, four different transcriptome technologies (gene expression array, tiling array, RNA-Seq, and TSS) are included for seven different L. monocytogenes strains grown in four different broth media under seven intracellular conditions ( Fig. 2A). Once transcriptomic data sets have been selected, it is possible to obtain the number of genes and small RNAs that are differently expressed between selected data sets and their corresponding reference biological conditions. Users can then directly visualize the relative gene expression values, shown as log fold changes, in a heat map representation. Also, specific lists of elements extracted from key publications can be chosen, such as a list of internalin genes (17) or surface proteins with the LPXTG motif (59), to display their expressions under specific biological conditions.
The coexpression network interface is designed to display possible correlations between genes and small RNAs in accordance with the "guilty by association" paradigm (60). This paradigm states that two genomic features that share the same expression profile might be involved in the same functional process. Pearson correlation coeffi- cients have been calculated for 42 tiling array and RNA-Seq data sets. Once a specific Pearson correlation coefficient cutoff has been selected, this interface filters the coexpression network to show only the selected genome elements and the genome elements having a correlation coefficient above the selected cutoff. Two types of displays are available for the coexpression network, including a standard force-directed graph visualization (Fig. 3B) and a circular graph visualization (Fig. 3C). The latter viewer shows the Pearson correlation coefficient between genome elements by using a representation of the circular bacterial genome. This visualization highlights possible coexpression between distant genomic loci.   The Listeriomics database contains 74 proteomics data sets (Fig. 2D). All these data can be accessed through a summary panel with a search interface. Users can visualize the selected data sets on a heat map showing the presence or absence of each Listeria protein under a specific biological condition. Contrary to transcriptomics data sets, relative protein expression values are not available.

L. monocytogenes Mutants
A webpage form is available to inform of the need to integrate a specific data set in the Listeriomics database. Before the publication of new genomes, transcriptomes, or proteomes to the Listeriomics website, the data sets must first be uploaded on referent repositories such as the Sequence Read Archive (61) or ArrayExpress (21). This process ensures that all the data sets are formatted in accordance with international standards and that minimal information on the experimental design of each study has been provided.
The multi-omics genome viewer. One of the key features of the Listeriomics interface is the multi-omics genome viewer. Figure 3A shows a variety of omics data sets produced for L. monocytogenes EGD-e grown at 37°C to the exponential phase in brain heart infusion (BHI) medium. RNA-Seq (62), tiling array (56), gene expression array (56), TSS (22), transcription termination site and ribosome profiling (63), and proteomics (36) data sets are displayed. To our knowledge, this is the first time that such a variety of omics scales can be browsed together through a genome viewer for a prokaryotic organism. Thus, the user can visualize the correlation of the genome annotation with transcription and translation for a specific coding RNA or sRNA. The genome viewer is dynamic, with zoom-in and zoom-out capabilities. The viewer also has search capability to access a specific position in the selected Listeria strain genome. Every omics data set present in the Listeriomics database can be added to this genome viewer.
Easy-access buttons on the home page of the Listeriomics interface allow quick access to three genome viewers with preloaded reference omics data sets. The first genome viewer is for L. monocytogenes EGD-e grown to the exponential phase at 37°C in BHI (Fig. 3A), the second is for the same condition but in stationary phase, and the last is for L. monocytogenes EGD-e grown in mouse macrophage cells.
Meta-analysis of Listeria transcriptomic data sets. A meta-analysis of the diversity of transcriptomic data sets available in the Listeriomics database offers a striking overview of the variety of studies performed by the Listeria research community. This wide range of studies covers a majority of the different living environments in which Listeria species have been observed to grow. As shown in Fig. 4A, several studies on the effects of different growth environments on Listeria have been performed, i.e., cold environments, acidic environments, specific gene deletions, common biocides, sugar availability, and intracellular growth. This variety of biological conditions can be browsed easily on the Listeriomics website. Notably, the most frequently used growth condition is bacterial growth to the exponential phase at 37°C in BHI. This biological condition is used as a reference condition in most of the studies.
We extracted the list of genes of L. monocytogenes EGD-e found to be differently expressed in the highest number of data sets ( Fig. 4B; see Table S6). Of the top 15 genes on this list, 7 are well-studied virulence genes (actA, hly, plcA, plcB, mpl, inlA, inlB, uhpT) all regulated by the PrfA protein. We identified actA as the most variable gene. This gene is differently expressed in 102 of the 279 available data sets for this Listeria strain. Among several other functions (64)(65)(66), the ActA protein is responsible for actin nucleation and intracellular propulsion. The other genes on the list are involved in either sugar metabolism (lmo0096, lmo2391, lmo0783, lmo2684) or stress response (lmo2158, lmo2673). Finally, a not-yet-described membrane protein, Lmo2484, is differently expressed in 81 of the 279 available data sets. This protein is conserved with Ͼ90% similarity in all 83 Listeria strains present in the Listeriomics database. We also found six genes that have never been demonstrated to be differently expressed in a data set ( Fig. 4C; see Table S6). As expected, these genes are involved in general bacterial physiology factors like DNA mismatch repair or reductase and transferase enzymes. Finally, we extracted the 651 genes of L. monocytogenes EGD-e that were found to be differently expressed in Ͼ10% of the 279 data sets (see Table S6). We performed a pathway enrichment analysis by using COG (Clusters of Orthologous Groups) information and the Fisher exact test (67) (Fig. 4D). We found that the cluster of genes differently expressed in the highest number of data sets was associated with the carbohydrate transport and metabolism functions (P ϭ 4.15e-14). This may be linked to the fact that, to survive in a wide variety of environments, Listeria bacteria have to be able to switch from one carbon source to another. The second most represented cluster is that of cell motility genes (P ϭ 2.18e-4). The third (P ϭ 1.3e-3) and fourth (P ϭ 2.1e-3) most represented clusters include genes without any specific function, most of them being annotated as encoding hypothetical proteins. The latter result highlights the fact that many of the important genes of Listeria species gene regulatory networks remain to be described. The Listeriomics resource is an essential tool for their investigation.
Meta-analysis of Listeria proteomic data sets. An analysis of the variety of proteomic data sets available in the Listeriomics database shows that many reference biological conditions have been studied. As for transcriptomic data, there is a great number of data sets produced by bacteria grown to the exponential phase at 37°C in BHI. Users can also access data sets from intracellular growth, cold environments, and the stationary growth phase. The key information in these proteomic data sets is the protein extraction protocol used that indicates which compartment of the bacterial cells (cytoplasm, membrane, cell wall, and secretome) has been analyzed. The cytoplasmic compartment is the most studied. The second most studied is the secretome compartment, the focus of many publications, since secreted proteins are key components of the Listeria intracellular life cycle. The ability to visualize these proteomic data sets together at the Listeriomics website will help the Listeria community to investigate further these groups of proteins and their roles in the different compartments of bacterial cells.
We investigated the 58 proteomic data sets available for L. monocytogenes EGD-e. We counted the data sets in which each protein was analyzed (see Table S6). As expected, we found that most of the proteins analyzed in half of the data sets come from the translation machinery or the carbohydrate transport and metabolism pathway. Remarkably, the first known virulence factor on the list is the pore-forming toxin LLO (Lmo0202), which is detected in 22 of the 58 proteomic data sets present in the Listeriomics database. We also found that Ͼ80% (2,344/2,859) of the L. monocytogenes EGD-e proteins have been observed in one or more proteomic data sets (see Table S6). Remarkably, 512 proteins were never shown to be produced.
Systems-level analysis of the L. monocytogenes EGD-e virulence locus. We investigated the coexpression network of the L. monocytogenes EGD-e virulence locus (lmo0200 to lmo0206), linking every gene with a Pearson correlation coefficient of Ͼ0.85 (Fig. 3B, and C). As part of the PrfA core regulon (68), we found uhpT (lmo0838) and inlC (lmo1786) coexpressed with the virulence locus. Strikingly, the prfA gene (lmo0200) did not appear to be strictly coexpressed with the other genes of the virulence locus. Neither the inlA (lmo0433) nor the inlB (lmo0444) gene, both of which are extensively studied virulence factors regulated by the PrfA protein, was found in the coexpression network. The two noncoding RNA elements rli51 and rli74 in the virulence locus were also found to be coexpressed. In addition, we found another gene, lmo0752, coexpressed with the virulence locus that has not been described in the PrfA core regulon (68). Nevertheless, this gene was previously described (69) as part of the bile tolerance locus (lmo0745 to lmo0755).

DISCUSSION
The availability of curated information on genes, proteins, and cellular processes is essential not only for providing a better understanding of the Listeria genus but also as a key element in the development of our understanding of biological processes with respect to food industry and clinical applications. A systems biology approach requires the integration of a diversity of data collections, including, among others, genes, sRNAs, mRNAs, proteins, metabolites, protein-protein interactions, and protein-RNA interactions.
In the case of the Listeria genus, no such resource is currently available. At the species and strain levels, a first difficulty is indeed the recovery of these data sets in existing repositories, as is the case for Listeria proteomic data sets. In this way, the Listeriomics resource (http://listeriomics.pasteur.fr/) provides a unique, comprehensive, and up-to-date source of information on Listeria, including all the available complete genomes to date; a unified annotation of genes, proteins, and noncoding RNAs; and associated omics data and metadata relevant to genomic comparative and systems biology analyses. The curation process focused on the quality of the metadata provided. We completed the process by searching information in the corresponding publications. To our knowledge, the Listeriomics interface is the first bioinformatics resource designed to help scientists in Listeria pathogenesis research using a systems biology approach.
The information integrated into the Listeriomics database is provided through a useable and friendly interface that allows querying and visualization of Listeria genomic, phylogenetic, transcriptomic, and proteomic data sets, along with information on small RNAs and coexpression network of genes and small RNAs. The user experience and feedback from our collaborators using the Listeriomics interface for the last past 5 years were driving forces in organizing and improving the way to access data and tools (17,22,71). Indeed, through the Listeriomics resource, it is possible to summarize what has been discovered about a specific gene, including information on its distribution in the different Listeria genomes. Users can also identify the different transcriptomics data sets in which a specific gene or small RNA is differently expressed. Similarly, scientists can browse the different proteomics data sets to identify biological conditions in which a specific protein was detected by mass spectrometry. Finally, regulatory networks between genes and small RNAs can be inferred and explored.
The field of microbiology is considerably impacted by new technologies, and the role of databases in microbiology research will become even more important shortly (72). Until now, most of the model organism resources have focused on the widely studied bacterium E. coli, often with an emphasis on the manual curation of some particular aspect of omics information. Only a few of them, like the RegulonDB resource dedicated to the E. coli transcriptional regulatory network, really address the challenge of integrating knowledge based on experimental high-throughput omics data sets (50). For B. subtilis, SubtiWiki originally focused on manually curated gene annotation with a collection of pages providing interlinked pages on B. subtilis gene properties, including valuable information such as essentiality or sporulation, obtained from qualified members of the Bacillus community (73). Interestingly, the latest release of Subtiwiki integrates some omics data, with new modules allowing the linkage of pathway, interaction, and expression information (52). To our knowledge, none of these databases dedicated to model organisms such as E. coli or B. subtilis integrates as many data sets as the Listeriomics resource does. Moreover, Listeriomics is the only resource including such a variety of comparative and evolutionary analyses at the genomic level.

MATERIALS AND METHODS
Integration of Listeria species genomes and sRNAs. The complete Listeria genomes available in the NCBI RefSeq and GenBank databases (see Table S1) were downloaded and integrated into our the Listeriomics database. Information about serotype, lineage, and the origin of strain isolation was included when possible. We used the PasteurMLST tool (74) to search for the sequence type and clonal complex of each strain. For L. monocytogenes EGD-e genes, we integrated their predicted functional group (11), COG, operon prediction (56), and subcellular protein localization prediction (27) (see Fig. S1).
Many studies have been performed to identify sRNAs in L. monocytogenes (22,(55)(56)(57)(58). Among these studies, only one concerned strain 10403S (57); the others focused on EGD-e. Altogether, 304 noncoding RNA elements have now been reported in L. monocytogenes, of which 154 are sRNAs, 104 are asRNAs, and 46 are cis-regulatory elements (cisRegs) including riboswitches (see Table S2 and Fig. S1). We included these elements in our database, along with supplementary information gathered from all related publications. Prediction of secondary structures at 37°C for each noncoding RNA were calculated by using UNAFold software (hybrid-ss-min default parameters on the whole RNA [76]).
Listeria ortholog gene families. Nucleic and amino acid sequences of all the annotated coding genes were produced for each complete Listeria genome by using GenBank files and a custom-made Python script. Amino acid sequences for each Listeria genome were aligned against those of other proteins of all the Listeria genomes by using BLASTPϩ (53) with an E-value threshold of 0.01. We used PanOCT (78) to build families of Listeria orthologs. PanOCT is able to deal with recently diverged paralogs by using neighborhood gene information. The percentage of genomes needed for a cluster to be considered an orthologous gene family was set to 100%. The length ratio to eliminate shorter protein fragments when a protein is split because of a frameshift event was set to 1.33 as previously recommended (78). Ortholog gene families were finally extracted from panoct.pl output files by using the gene_order.pl script, which is included in the PanOCT archive.
Amino acid sequences of each cluster were aligned by using ProbCons version 1.12 (79) with default parameters. Resulting alignments were postprocessed to filter unreliable positions by using Gblocks version 0.91b (80) with the parameter settings as follows. The minimum number of sequences for a conserved position was set to (n/2) ϩ 1 (where n is the total number of sequences in the aligned data set), the maximum number of contiguous nonconserved positions was set to 50, the minimum length required for a block was set to 5, and gap positions were not allowed. The corresponding nucleic acid sequence alignments were obtained from each cluster amino acid sequence alignment with a custommade Perl script.
Listeria species reference phylogenomic tree. Nucleic acid sequence alignments of ortholog gene families were concatenated into a single superalignment. This superalignment was used to compute a maximum-likelihood tree by using FastTree 2 (81), a parallelized and optimized software, to build maximum-likelihood trees. The following parameters were used. The generalized time-reversible model was chosen, the exhaustive search mode was selected to obtain a more accurate reconstruction, NNI and SPR heuristics were used to browse the tree space, and BIONJ weighting was chosen to join events during tree space browsing. Support analyses were performed with the Shimodaira-Hasegawa test associated with 1,000 resampling steps of site likelihood.
Integration of transcriptomic data sets. We downloaded the 64 published Listeria-related Array-Express experiments (21) in MAGE-TAB standard (82) (Fig. 1; see Table S3). Every MAGE-TAB file included an IDF (investigation description format) file showing general information about the experiment and an SDRF (sample and data relationship format) file showing data relationship. We manually curated every SDRF file and integrated the different metadata of the data set by using the publication linked to each study when available. We grouped the metadata information into key biological parameters for each data set: growth, time point, temperature, mutant, media, strain used, and strain array ( Fig. 2A to C).
We then added matrices of expression or comparison for each biological condition. In total, 32 different gene expression arrays and six RNA-Seq technologies were combined. For gene expression arrays, we downloaded the 32 ADF (array design format) files that describe the relationship between array probes and genes. Only processed data provided by ArrayExpress were used, and no new normalization was applied. For each gene, we calculated the median differences of log values for relative expression tables. In the case of RNA-Seq, alignments of raw reads were performed with the Bowtie (83), segemehl (84), or novoalignCS tool, depending on the sequencing technology. For each alignment file, we computed the number of reads per kilobase per million mapped reads for each genomic feature and the number of reads per million for genome-wide coverage. Differential expression analysis was performed with DESeq version 1.14.0 (85) on per-feature raw counts.
Altogether 492 files, 150 absolute value data, and 342 relative expression data were created and integrated into the Listeriomics website ( Fig. 1 and 2; see Fig. S1). For every experiment, relative expression data were always available, whereas absolute expression data were found for only a quarter of the data.
Construction of an expression atlas. We designed an expression atlas to provide for each transcriptomic data set a list of the differently expressed genome elements based on log fold change values. A statistical analysis (Shapiro-Wilk and Lilliefors tests) showed that all the relative expression data sets have a Gaussian distribution with a mean equal to zero. We applied normalization directly to log fold change values by multiplying every value by 1 divided by the square root of sigma, where sigma is the estimated standard deviation of each data set. This normalization ensured that every log fold change data set will have a standard deviation equal to 1 (Fig. S2). Consequently, the user can apply a standard cutoff, like log fold change Ͻ 1.5, and extract the differently expressed genome elements for every data set. A standard deviation equal to 1 leads, on average, to a selection of 7.1% of the most differently expressed genome elements, without removing the existing differences in the number of elements differently expressed under different biological conditions. Construction of coexpression networks. We selected 42 transcriptomic data sets for which the absolute expression of genes and small RNAs was available (21 tiling array and 21 RNA-Seq [see Table S4 and Fig. S1]). The Pearson correlation coefficient for each genome element across all biological conditions was calculated. A link in the coexpression networks was created only when the absolute Pearson correlation coefficient was Ͼ0.80.
Integration of proteomic data sets. We selected 23 publications by using the PubMed database, in which a mass spectrometry experiment with a Listeria strain was performed (see Table S5). We extracted all the available information on the biological conditions screened in each experiment from associated articles and supplementary files (Fig. 1). We also extracted the metadata information in the same format for the transcriptomic data sets. In all the experiments, a list of the proteins detected was available. In total, we extracted 102 proteomics files (74 absolute expression data sets and 28 relative-expression data sets [ Fig. 2D; see Fig. S1]).

Construction of the Listeriomics website and desktop versions.
The Listeriomics website was built by using the BACNET development platform (unpublished data). This platform is based on Java and Eclipse e4 RCP/RAP API for building both web and desktop software versions and provides a rich user interface. This open-source platform is generic and can be used to set up a similar website for any organism.
Tutorials for the Listeriomics website can be found on the Listeriomics mediaWiki webpage. These tutorials can be accessed either from the home webpage or directly at http://wiki.listeriomics.com/.