GSEA & annData#

Often it is really informative and useful to be able to run gene enrichment analysis on an annData object to see what pathways or signatures are enriched in certain clusters compared to others. There are packages that can be coerced to do this but they are not always straightforward to use. The purpose of this notebook is to show you how to run GSEA on your scRNAseq data and visualize the enichment results. We have made these tasks extremely easy through the use of our helper functions that seamlessly interface to GSEAPY for the analysis, and through the use of our visualization functions.

Data#

The training data is a subset of the mouse gastrulation atlas described in Pijuan-Sala et al Nature 2019. We have reduced the size of this dataset to make running this tutorial more accessible. We have also re-annotated the cell lineages based on recent spatial transcriptomics studies of mouse gastrulation (see Lohoff et al 2021 and Kumar et al 2023).

  • You can download the h5ad file from here

  • There are 1,875 cells

  • 29,452 genes

  • 25 cell types or lineages

  • Sampled from embryonic stages E6.5 to E8.5

Gene signatures#

In this notebook, we will onyl use the mouse MSigDB Hallmarks as described here but any appropriate .gmt file can be used. Below are some additional gene signature sets that we have cobbled together from various sources and that are useful in developmental and stem cell systems:

Steps to GSEA#

Load packages#

import scanpy as sc
from joblib import dump, load
import numpy as np
import pandas as pd
import pySingleCellNet as pySCN
import matplotlib.pyplot as plt
import seaborn as sns
import anndata

import warnings
warnings.filterwarnings('ignore')
sc.settings.verbosity = 0

Load gene signatures#

hallmarks = pySCN.read_gmt("mh.all.v2023.2.Mm.symbols.gmt")

Load data#

Then subject it to the usual processing pipeline. Low quality cell barcodes have alread been removed.

adata = sc.read_h5ad("mouseGastrulation_n75_032124.h5ad")
adata = pySCN.mito_rib(adata, species='MM', clean=True)
min_num_cells = 37
sc.pp.filter_genes(adata, min_cells=min_num_cells)

Norm/HVG/PCA#

adNorm = pySCN.norm_hvg_scale_pca(adata)
sc.pl.pca_variance_ratio(adNorm, 50)
../_images/2525a2ad6b37cba911e45aea6a8a259f53522b27e3fb5abb1800fffb243c1991.png

kNN/Cluster/UMAP#

n_pcs = 30
n_neighbors = 30

sc.pp.neighbors(adNorm, n_neighbors=n_neighbors, n_pcs=n_pcs)

sc.tl.leiden(adNorm,.1)
sc.tl.paga(adNorm)
sc.pl.paga(adNorm, plot=False)
sc.tl.umap(adNorm, init_pos='paga')
sc.pl.umap(adNorm,color=['celltype'], alpha=.75, s=15)
../_images/b9917b4c468a874038eb464f4ed1957ced9855768a099a22de49ed6e948b80b0.png
sc.pl.umap(adNorm,color=['leiden'], alpha=.75, s=15, legend_loc='on data')
../_images/973eed282bbcb476414f5488991ef3ec418a419e599dc3f5d13f193e70be6d3d.png

Annotate these ‘coarse’ clusters#

The low Leiden resolution resulted in coarse clustering roughly corresponding:

  • Erythrocyte (7)

  • Endothelial, hemogenic endothelium (4)

  • Mesoderm (0)

  • Ectoderm, Epiblast, Primordial germ cell (1)

  • APS, Node and Notochord (2)

  • Endoderm (3)

  • Extra-embryonic ectoderm (5)

  • Parietal endoderm (6).

cell_dict = {‘Erythrocyte’: [‘7’], ‘HE, endothelial’: [‘4’], ‘Mesoderm’: [‘0’], ‘Ectoderm, Epiblast, PGC’: [‘1’], ‘APS, Node, Notochord’: [‘2’], ‘Endoderm ‘: [‘3’], ‘ExE Ectoderm’: [‘5’], ‘Parietal Endoderm’: [‘6’] }

new_obs_name = ‘coarse_cluster’ adNorm.obs[new_obs_name] = np.nan

for i in cell_dict.keys(): ind = pd.Series(adNorm.obs.leiden).isin(cell_dict[i]) adNorm.obs.loc[ind,new_obs_name] = i

adNorm.obs[new_obs_name] = adNorm.obs[new_obs_name].astype(“category”)

sc.pl.umap(adNorm,color=[new_obs_name], alpha=.75, s=15, legend_loc='on data', legend_fontsize=8)
../_images/5ccaddfeb9369a8ecfec389a5bfcaee0313ffa38c55f6b56873d7c74ef69ec01.png

Analysis of coarse clusters#

In the next few steps, we will run differentiation gene expression analysis and then GSEA to identify genes and pathways enriched in these broadly annotated cell groups.

Differential gene expression (DEG) analysis - coarse#

sc.tl.rank_genes_groups(adNorm, use_raw=False, groupby="coarse_cluster")
sc.tl.filter_rank_genes_groups(adNorm, min_fold_change=.5, min_in_group_fraction=.3, max_out_group_fraction=.1)

fig, ax = plt.subplots(figsize=(5, 5))
sc.pl.rank_genes_groups_dotplot(adNorm, n_genes=3, groupby="coarse_cluster", dendrogram=True, key='rank_genes_groups_filtered', swap_axes=True, ax=ax)
../_images/cc3d696d85005c9e4e0a2aed0faaf2d2e8a875c7ed703261cd34900f23bbd72f.png

Note to self: re-work the figure above so that cell group labels on x-axis are rotated 45, and remove the gene-group labes on the y-axis. Probably can do so by extracting vars, then passing to sc.pl.DotPot.

GSEA#

deg_res = pySCN.convert_diffExp_to_dict(adNorm)
gsea_res = pySCN.gsea_on_deg(deg_res, "hallmarks",genesets = hallmarks, permutation_num = 1e3)
2024-03-21 17:44:34,340 [WARNING] Duplicated values found in preranked stats: 0.02% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:44:37,130 [WARNING] Duplicated values found in preranked stats: 0.02% of genes
The order of those genes will be arbitrary, which may produce unexpected results.

Heatmap GSEA results#

gsea_matrix_coarse = pySCN.collect_gsea_results_from_dict(gsea_res, .05)
pySCN.heatmap_gsea(gsea_matrix_coarse, figsize=(7,12),clean_signatures = True, clean_cells=True)
../_images/46d91a372dbb05ad1991d23dbb63f92c5b06d6a70d8f0ab41d0aef037de8ba9c.png

Analysis of cell types#

Here, we will run DEG and GSEA based on the more finely resolved cell annotations that were provided with the data.

DEG#

sc.tl.rank_genes_groups(adNorm, use_raw=False, groupby="celltype")
sc.tl.filter_rank_genes_groups(adNorm, min_fold_change=.5, min_in_group_fraction=.5, max_out_group_fraction=.15)
sc.tl.dendrogram(adNorm, "celltype")
sc.pl.rank_genes_groups_dotplot(adNorm, n_genes=1, groupby="celltype", dendrogram=True, key='rank_genes_groups_filtered', swap_axes=True)
../_images/338232072fcad454b44830ef15f510215092e64637e21e15f85ff3d44aa01a66.png

GSEA#

deg_res_ct = pySCN.convert_diffExp_to_dict(adNorm)
gsea_res_ct = pySCN.gsea_on_deg(deg_res_ct, "hallmarks",genesets = hallmarks, permutation_num = 1e3)
2024-03-21 17:44:51,532 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:44:52,418 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:44:53,323 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:44:54,224 [WARNING] Duplicated values found in preranked stats: 0.02% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:44:56,031 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:44:57,850 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:00,560 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:01,466 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:03,230 [WARNING] Duplicated values found in preranked stats: 0.02% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:04,139 [WARNING] Duplicated values found in preranked stats: 0.02% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:05,034 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:05,946 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:06,814 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:07,690 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:08,560 [WARNING] Duplicated values found in preranked stats: 0.04% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-03-21 17:45:09,438 [WARNING] Duplicated values found in preranked stats: 0.01% of genes
The order of those genes will be arbitrary, which may produce unexpected results.

Heatmap GSEA results#

gsea_matrix_ct = pySCN.collect_gsea_results_from_dict(gsea_res_ct, .05)
pySCN.heatmap_gsea(gsea_matrix_ct, figsize=(10,12),clean_signatures = True, clean_cells=True)
../_images/23a730171b6a9720f20e64a80a1306d40c8e020cdb59be97779ef4c494a8585e.png