GSEA-P: a desktop application for Gene Set Enrichment Analysis

Subramanian, Aravind; Kuehn, Heidi; Gould, Joshua; Tamayo, Pablo; Mesirov, Jill P.

doi:10.1093/bioinformatics/btm369

Abstract

Gene Set Enrichment Analysis (GSEA) is a computational method that assesses whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. We report the availability of a new version of the Java based software (GSEA-P 2.0) that represents a major improvement on the previous release through the addition of a leading edge analysis component, seamless integration with the Molecular Signature Database (MSigDB) and an embedded browser that allows users to search for gene sets and map them to a variety of microarray platform formats. This functionality makes it possible for users to directly import gene sets from MSigDB for analysis with GSEA. We have also improved the visualizations in GSEA-P 2.0 and added links to a new form of concise gene set annotations called Gene Set Cards. These additions, as well as other improvements suggested by over 3500 users who have downloaded the software over the past year have been incorporated into this new release of the GSEA-P Java desktop program.

Availability: GSEA-P 2.0 is freely available for academic and commercial users and can be downloaded from http://www.broad.mit.edu/GSEA

Contact: mesirov@broad.mit.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The selection of differentially expressed genes helps associate biological phenotypes with their underlying molecular mechanisms thereby providing insights into biological function. However, analyzing and interpreting a given list of genes can be challenging due to the difficulty of objectively evaluating members of a given pathway or functional class represented in a gene list. Additionally, single gene-marker-based approaches can fail to detect transcriptional programs that are distributed across an entire network of genes yet are subtle at the level of individual genes.

To address this problem we previously introduced a statistical methodology called Gene Set Enrichment Analysis (GSEA) for determining whether a given gene set is significantly enriched in a list of gene markers ranked by their correlation with a phenotype of interest (Mootha et al., 2003), (Subramanian et al., 2005). The method has been successfully used to discover metabolic pathways altered in human diabetes (Mootha et al., 2003), compare expression profiles of mouse and humans (Sweet-Cordero et al., 2005), reveal more consistency between independent lung cancer outcome datasets at the gene set level than at the single gene level (Subramanian et al., 2005) and characterize molecular phenotypes in acute megakaryoblastic leukemia (Bourquin et al., 2006), amongst many other applications.

Given a list of genes, ranked by the correlation of their genome-wide expression profiles with one of two phenotypes, GSEA seeks to estimate the significance of the over-representation of an independently defined set of genes, S, in the highly correlated or anti-correlated genes in the list. To evaluate this degree of ‘enrichment’ the GSEA method calculates an Enrichment Score (ES) by walking down the list, increasing a cumulative sum when a gene is in S and decreasing it if a gene is not in S. The size of the increment depends on the gene-phenotype correlation. The ES is the maximum deviation from zero of the cumulative sum and can be interpreted as a weighted Kolmogorov-Smirnov statistic. The genes in the gene set S that appear in the ranked list before the point where the running sum achieves the ES are called the leading-edge subset and are particularly important in evaluating the results of GSEA analysis. The significance of a gene set's ES is estimated by an empirical phenotype-based permutation test procedure. When an entire database of gene sets is scored, an adjustment must be made to the resulting P-values to account for multiple hypotheses testing. GSEA normalizes the ES for each gene set to account for the variation in set sizes, yielding a normalized enrichment score (NES), and calculates a false discovery rate (FDR) corresponding to each NES. The FDR gives an estimate of the probability that a set with a given NES represents a false positive finding; it is computed by comparing the tails of the observed and permutation-computed null distributions for the NES.

The power of GSEA depends on how well the gene sets used to assess enrichment represent meaningful coordinated gene expression behavior that reflects actual biological processes. The more accurately gene sets represent specific transcriptional processes relevant for a particular cellular state the better they will perform as GSEA queries. For this reason, the definition and curation of gene sets is of paramount importance. We have begun a process of systematically collecting gene sets into a Molecular Signature Database (MSigDB).

The MSigDB contains over 3000 gene sets of different types: (i) sets representing genes in the same chromosome or cytogenetic band, (ii) gene sets representing metabolic and signaling pathways from eight publicly available, manually curated pathway databases, (iii) genes reported in the literature as coexpressed in response to genetic or chemical perturbations, (iv) genes sharing conserved upstream regulatory motifs and (v) sets of genes in expression neighborhoods of cancer-related genes. Users may use this resource or define their own gene sets relevant to the process or phenotype they are investigating.

Version 1.0 of the GSEA-P software and MSigDB were originally released in Spring of 2005. There are currently over 3500 registered users. The new version 2.0 of both the software and the database represent a substantial enhancement of the features, interface and content, which we describe below.

2 FEATURES

Version 2.0 of the GSEA-P Java desktop software contains a complete implementation of the GSEA methodology, including leading edge analysis, as well as several usability improvements based on user feedback. New features include a gene set browser to search, download and map gene sets from the MSigDB database. We also have developed a website with comprehensive software documentation and Gene Set Cards with annotations including the source and biological relevance of MSigDB gene sets. A complete list of the new and improved features is in Supplementary Table 1.

2.1 Enrichment analysis

In enrichment analysis, a user seeks to determine whether the members of a gene set are over-represented at the top (or bottom) of a ranked list of markers which have been ordered by their correlation with a specified phenotype. This functionality is central to the GSEA-P 2.0 software and is accessed via the ‘Run GSEA page’. Users select a dataset, phenotype and a gene set collection and set parameters to run an enrichment analysis. We have improved this interface by enabling conversion of the dataset and gene sets to the same identifier format (i.e. gene symbols) before running the analysis (see ‘Chip2Chip’ description below and Supplementary Figure 1). To address the need for alternative or specialized gene ranking procedures, we now provide a vehicle within the GSEA-P software for use with a user-provided ranked gene list. Importantly, in this new release, enrichment results are saved to an XML formatted local database and hence are available for downstream analysis with other GSEA components (see ‘Leading edge analysis’ below) and integration with other software programs.

2.2 Enrichment reports

GSEA-P 2.0 produces richly annotated HTML reports of enrichment results. In addition to statistical details such as the ES, P-value and FDR, we now provide a link to gene set annotations at the MSigDB website. These annotations allow users to view the full details of the provenance and content of a gene set in a structure similar to that of the GeneCards resource (Rebhan et al., 1997). The GSEA report also contains improved enrichment plots (Supplementary Fig. 2).

2.3 Leading edge analysis

After an enrichment analysis has been performed, it is often useful to examine and compare the genes in high scoring sets which occur before the maximum of the running ES. These genes can be thought of as the core of a gene set that drives the enrichment signal. By grouping leading edge subsets, high scoring gene sets can often be categorized into similar and distinct biological processes.

To facilitate leading edge analysis, GSEA-P 2.0 provides an interactive viewer that can be run after a GSEA process completes. The user selects gene sets for leading edge analysis after which the program: (1) computes the core matrix over all selected gene sets, (2) clusters this matrix and (3) visualizes the result in a heat map (Supplementary Fig. 3A). Additionally, similarities between gene sets can be visualized by the Jacquard coefficient (Supplementary Fig. 3B).

2.4 Batch analysis mode

To support the analysis of a large number of datasets or the integration of GSEA into a data analysis pipeline, GSEA-P 2.0 can run in ‘headless’ mode as part of a shell script or load sharing facility. The analysis performed and the reports produced in this mode are identical to those produced with the graphical user interface.

2.5 Mapping identifiers between platforms with Chip2Chip

Microarray platforms come from a number of manufacturers who use a variety of identifiers to represent gene transcripts. Additionally, cross-species comparisons require ortholog mappings. Several tools such as NetAffx (Liu et al., 2003) provide the ability to map a given list of genes between platforms. However, these programs are often restricted to a particular vendor or are cumbersome to use when mapping a large collection of gene sets as they are tailored to map a single input list. To address this need, the GSEA-P 2.0 software provides a new utility called Chip2Chip that maps identifiers between platforms. Currently, GSEA-P 2.0 supports mappings between 93 platforms. Chip2Chip can convert between Entrez gene symbols and any of these platforms or between identifiers for any two of these chip types (Supplementary Fig. 4).

2.6 Integrated gene set browser & query interface

To enable users of GSEA-P 2.0 to easily access the substantially enlarged MSigDB 2.0 collection we have embedded a gene set browser into the software which enables users to quickly search MSigDB for gene sets using an intuitive graphical user interface (Supplementary Fig. 5). By providing an integrated program, we enable the seamless interoperation of gene set analytics with the MSigDB gene sets database.

2.7 Documentation

The website accompanying GSEA-P 2.0 includes extensive documentation: a user guide describing all aspects of the software, an illustrated tutorial, a frequently asked questions section, as well as four examples of GSEA analysis and results. The documentation is packaged into a GSEA Wiki site which will grow over time.

ACKNOWLEDGEMENTS

The authors wish to thank members of the Cancer Program at the Broad Institute for suggestions. They also thank Jide software for a free license to their component suite.

Conflict of Interest: none declared.

REFERENCES

Bourquin

JP

, et al.

Identification of distinct molecular phenotypes in acute megakaryoblastic leukemia by gene expression profiling

,

Proc. Natl Acad. Sci. USA

,

2006

, vol.

103

(pg.

3339

-

3344

)

Google Scholar

Crossref

WorldCat

Liu

G

, et al.

NetAffx: affymetrix probesets and annotations

,

Nucleic Acids Res

,

2003

, vol.

31

(pg.

82

-

86

)

Mootha

VK

, et al.

PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes

,

Nat. Genet

,

2003

, vol.

34

(pg.

267

-

273

)

Rebhan

M

, et al.

GeneCards: Encyclopedia for Genes, Proteins and Diseases, Weizmann Institute of Science, Bioinformatics Unit and Genome Center

,

Trends in Genetics

,

1997

, vol.

13

pg.

163

Subramanian

A

, et al.

From the Cover: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

,

Proc. Natl Acad. Sci. USA

,

2005

, vol.

102

(pg.

15545

-

15550

)

Google Scholar

Crossref

WorldCat

Sweet-Cordero

A

, et al.

An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis

,

Nat. Genet

,

2005

, vol.

37

(pg.

48

-

55

)

Author notes

Associate Editor: Olga Troyanskaya

Download all slides

Month:	Total Views:
November 2016	12
December 2016	5
January 2017	48
February 2017	66
March 2017	80
April 2017	76
May 2017	59
June 2017	86
July 2017	72
August 2017	82
September 2017	68
October 2017	88
November 2017	98
December 2017	179
January 2018	198
February 2018	143
March 2018	166
April 2018	185
May 2018	168
June 2018	169
July 2018	170
August 2018	191
September 2018	146
October 2018	148
November 2018	220
December 2018	137
January 2019	153
February 2019	162
March 2019	199
April 2019	199
May 2019	192
June 2019	167
July 2019	241
August 2019	226
September 2019	167
October 2019	159
November 2019	169
December 2019	138
January 2020	168
February 2020	150
March 2020	174
April 2020	212
May 2020	133
June 2020	165
July 2020	131
August 2020	121
September 2020	183
October 2020	172
November 2020	166
December 2020	212
January 2021	205
February 2021	176
March 2021	265
April 2021	243
May 2021	238
June 2021	247
July 2021	222
August 2021	259
September 2021	236
October 2021	184
November 2021	240
December 2021	252
January 2022	211
February 2022	184
March 2022	266
April 2022	258
May 2022	233
June 2022	214
July 2022	168
August 2022	201
September 2022	198
October 2022	253
November 2022	241
December 2022	169
January 2023	205
February 2023	242
March 2023	233
April 2023	264
May 2023	238
June 2023	160
July 2023	183
August 2023	195
September 2023	210
October 2023	228
November 2023	131
December 2023	235
January 2024	278
February 2024	241
March 2024	294
April 2024	142

Article Contents

GSEA-P: a desktop application for Gene Set Enrichment Analysis

Abstract

1 INTRODUCTION

2 FEATURES

2.1 Enrichment analysis

2.2 Enrichment reports

2.3 Leading edge analysis

2.4 Batch analysis mode

2.5 Mapping identifiers between platforms with Chip2Chip

2.6 Integrated gene set browser & query interface

2.7 Documentation

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

GSEA-P: a desktop application for Gene Set Enrichment Analysis

Abstract

1 INTRODUCTION

2 FEATURES

2.1 Enrichment analysis

2.2 Enrichment reports

2.3 Leading edge analysis

2.4 Batch analysis mode

2.5 Mapping identifiers between platforms with Chip2Chip

2.6 Integrated gene set browser & query interface

2.7 Documentation

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only