We provide a standardized protocol for the use of gene set enrichment analysis of transcriptomic data to identify an ideal mouse model for translational research.
This protocol can be used with DNA microarray and RNA sequencing data and can further be extended to other omics data if data are available.
Recent studies that compared transcriptomic datasets of human diseases with datasets from mouse models using traditional gene-to-gene comparison techniques resulted in contradictory conclusions regarding the relevance of animal models for translational research. A major reason for the discrepancies between different gene expression analyses is the arbitrary filtering of differentially expressed genes. Furthermore, the comparison of single genes between different species and platforms often is limited by technical variance, leading to misinterpretation of the con/discordance between data from human and animal models. Thus, standardized approaches for systematic data analysis are needed. To overcome subjective gene filtering and ineffective gene-to-gene comparisons, we recently demonstrated that gene set enrichment analysis (GSEA) has the potential to avoid these problems. Therefore, we developed a standardized protocol for the use of GSEA to distinguish between appropriate and inappropriate animal models for translational research. This protocol is not suitable to predict how to design new model systems a–priori, as it requires existing experimental omics data. However, the protocol describes how to interpret existing data in a standardized manner in order to select the most suitable animal model, thus avoiding unnecessary animal experiments and misleading translational studies.
Animal models are widely used to study human diseases, because of their assumed similarity to humans in terms of genetics, anatomy, and physiology. Moreover, animal models often serve as gatekeepers to clinical therapies and can have a huge impact on the success of translational research. Careful selection of the optimal animal model can reduce the number of misleading animal studies. Recently, the relevance of animal models for translational research has been controversially discussed, particularly because analyzing the same datasets obtained from human inflammatory diseases and related mouse models led to contradictory conclusions 1,2. This discussion revealed a fundamental problem during analyzing omics data: standardized approaches for systematic data analysis are needed in order to reduce biased gene selection and to increase the robustness of interspecies comparisons 3.
Traditionally, the analysis of transcriptomics data (and other omics data) is done at the single-gene level and includes an initial step of gene selection based on stringent cut-off parameters (e.g., fold change >2.0, p value <0.05). However, the setting of initial cut-off parameters often is subjective, arbitrary and not biologically justified, and can even lead to opposite conclusions1,2. Furthermore, initial gene selection generally restricts the analysis to a few highly up- and downregulated genes and is thus not sensitive enough to include the majority of genes that were differentially expressed to a lesser extent.
With the rise of the genomics era in the early 2000s and the increasing knowledge of biological pathways and contexts, alternative statistical approaches were developed that allowed to circumvent the limitations of single-gene level analyses. Gene set enrichment analysis (GSEA)4, which is one of the widely accepted methods for the analysis of transcriptomics data, makes use of a-priori defined groups of genes (e.g., signaling pathways, proximal location on a chromosome etc.). GSEA first maps all detected unfiltered genes to the intended gene sets (e.g., pathways), irrespective of their individual change in expression. This approach thus also includes moderately regulated genes that would otherwise be lost with single-gene level analyses. The additive change in expression within gene sets is subsequently performed using running sum statistics.
Despite its wide use in medical research, GSEA and related set enrichment approaches are not self-evidently taken into account for the analysis of complex omics data. Here, we describe a protocol for comparing omics data from human samples with those from mouse models in order to identify the ideal model for translational studies. We demonstrate the applicability of the protocol based on a collection of mouse models that are used for mimicking human inflammatory disorders. However, this analysis pipeline is not restricted to human-mouse comparisons and is amendable to further research questions.
1. Download of the GSEA Software and the Molecular Signatures Database
2. Download Experimental Gene Expression Data for the Human Disorder and Appropriate Animal Models
3. Data Handling and Formatting
4. Performing the GSEA
5. Comparing the GSEA Results
6. Identifying the Optimal Animal Model
The GSEA workflow and screenshots of exemplary data are demonstrated. Figure 1 shows the gene expression data file that contains the transcriptomic data of interest. For every study a descriptive phenotype file is required that is shown in Figure 2. Annotated gene sets (e.g., pathways) are defined in the gene set database file (Figure 3). Figure 4 shows a step-by-step protocol for the use of the GSEA software tool. An exemplary result report is given in Figure 5. Detailed GSEA enrichment results are summarized in Figure 6. For the comparison of different gene expression studies, in particular human vs. mouse studies, a contingency table is required (Figure 7). For the visualization of the results, Figure 8 shows a correlation matrix of pathway comparisons among human and mouse studies.
Figure 1: GSEA Gene Expression Data File. The file contains expression values for all detectable genes (or probes), also for genes that might not be differentially expressed. The file therefore typically comprises many thousands of genes. (A) The gene expression data file includes data for each individual sample. The first line contains the labels name (here: probe ID) followed by an optional description and individual sample names (here: GSM515585, GSM515586, etc.). The remainder of the file contains expression values for each of the genes and for each sample in the dataset. (B) Alternative gene expression data format. Externally calculated group metrics (here: mean ratio) can be used for the GSEA preranked tool if individual sample data are not available. Please click here to view a larger version of this figure.
Figure 2: GSEA Phenotype File. The file combines individual samples to groups and labels the groups accordingly. The first line contains the total number of samples and further the number of groups. The third field of the first line is always '1'. The second line contains the name for each group. The line begins with a pound sign (#) followed by a space. The third line contains a group label for each sample (here: 0 or 1). Please click here to view a larger version of this figure.
Figure 3: GSEA Gene Set Database File. The file defines sets of genes that are assigned to certain biological processes or categories (here: inflammatory pathways). In the GMT format, each row represents a gene set, which is defined by a name, a description, and the included genes (official HUGO gene symbols). Please click here to view a larger version of this figure.
Figure 4: GSEA Software Settings. The GSEA software tool was downloaded from the Broad Institute website as a java desktop application. (A) Start menu. The left side contains the navigation menu while the right section (Home) gives a short summary of the GSEA workflow. Clicking the Load data button will open a new tab for importing the files. (B) Load data section before data import. Required files can be imported via the file browser. (C) Load data section after data import. Imported data files are listed in the Object cache and are organized to datasets (mandatory file), phenotypes (mandatory file), gene set databases (optional, if internet connection provided) and chip files (optional, if internet connection provided). Clicking on the Run GSEA button will open a new tab for setting the analysis parameters. (D) Run GSEA section. The tab for setting the analysis parameters is divided into required fields, basic fields and advanced fields. Clicking the Run button on the on the right bottom of the window will start the analysis. The progress of the analysis will then be visible in the GSEA reports section on the left bottom of the window. After finishing the analysis, the status 'success' appears in the GSEA reports section. (E) GSEA preranked tool. Gene expression data files containing externally calculated group metrics instead of individual sample data can be analyzed via the main navigation bar. Please click here to view a larger version of this figure.
Figure 5: GSEA Report. The GSEA report will open in a browser window that summarizes all results and selected parameters. The upper two sections of the navigation menu comprise gene set enrichment results for the defined groups (e.g., enrichment in S. aureus treated samples or healthy control samples). In that example, 42 of 65 gene sets (pathways) are activated in S. aureus treated mice, while 14 of them are significantly enriched with an FDR below 25%. Similarly, 23 of 65 gene sets (pathways) are inhibited in S. aureus treated mice, while 18 of them are significantly enriched with an FDR below 25%. Clicking on the detailed enrichment results opens an html or excel file for exporting the analysis data required for a comparison of different gene expression studies. Please click here to view a larger version of this figure.
Figure 6: Detailed Enrichment Results. (A) Exported spreadsheet file containing detailed analysis results for gene sets (pathways) that were activated in S. aureus treated mice. The spreadsheet file contains huge data for each of the analyzed gene set, including the name of the gene set, its size, its normalized enrichment score, its nominal (uncorrected) p value and its FDR value. (B) Simplified spreadsheet file that only contains information required for comparing different gene expression studies. Please click here to view a larger version of this figure.
Figure 7: 3 x 3 Contingency Table of GSEA Results. (A) Common contingency table format for the comparison of 2 studies. (B) Exemplary numbers of regulated pathways for the comparison of a human sepsis study (GSE9960) with a murine S. aureus injection model (GSE20524). Please click here to view a larger version of this figure.
Figure 8: Correlation Matrix of Pathway Comparisons Between Human and Mouse Studies. The overlap of pathway regulation is shown as the gain of information that can be obtained from one (mouse) study for predicting the effects in another (human) study (blue, decrease, low correlation; red, increase, high correlation). In that example, the comparison of human with murine datasets revealed a subgroup of experimental murine models that were highly correlative to human clinical studies (studies 10 and 11, dotted line), indicating that these mouse models are best suited for mimicking the human situation. In contrast, the studies 7, 8 and 9 showed no correlation to the human disease studies. Please click here to view a larger version of this figure.
Animal models have long been applied for the investigation of disease mechanisms and the development of novel therapeutic strategies. However, skepticism regarding the predictivity of animal models started to spread following failures of clinical trials12. Furthermore, controversial discussions about appropriate strategies for analyzing and interpreting big omics data from preclinical trials were raised by opposite conclusions drawn from the same data after applying differing data analysis strategies1,2. Consequently, there is a high demand for further robust bioinformatics techniques for the analysis of complex omics data to systematically define the optimal animal model for a given human disease. Applying the best available model not only improves translational research but further contributes to animal welfare by avoiding animal experiments that might not correlate with the human situation.
The presented protocol describes a standardized approach to systematically compare omics data of different species with the aim to identify the optimal animal models and treatment protocols for a given human disorder. By the use of GSEA instead of a single-gene analysis, this protocol circumvents all problems associated with subjective setting of gene expression thresholds and gene filtering. The focus on selected pathways further allows to specifically address the (patho)physiological process of the disorder/condition in question (e.g., inflammation). Of course, the accuracy of the GSEA results depends on the quality of current gene set annotations and on whether regulation mechanisms are conserved between species. However, we hypothesize that in general the conservation is higher at pathway level than on single gene level. In addition, set enrichment approaches are more robust for comparisons of transcriptomic data between different platforms and experimental models or clinical cohorts than single-gene analyses13.
Instead of using pre-defined gene sets such as pathways, the presented approach also allows to define custom gene sets. In particular, experimental expression data can be used to identify relevant genes that are activated or inhibited in one condition (e.g., overlap of regulated human genes in clinical cohorts). The de novo defined gene sets can then be used to test for the enrichment of data from different animal models. This alternative approach avoids the 'detour' of using annotated pathways. Further, the protocol is not restricted to the comparison of transcriptomic data, but is transferable to any omics data including proteomics and metabolomics. Nonetheless, one has to keep in mind that this approach is limited to existing omics data from mouse models and humans, and that it does not indicate how to develop new animal models. However, it represents an effective approach for the standardized interpretation of existing data, which may facilitate the careful selection of the optimal animal model and thus avoid unnecessary and misleading translational studies.
The authors have nothing to disclose.
This work was financed by the German Federal Institute for Risk Assessment (BfR).