Prepare reference data#

Summary#

Below is a brief tutorial that shows you how we prepared a scRNA-seq data set so that it is ready to be used to train a PySCN classifier. In general, any scRNA-seq data can be used as training data, as long as it is carefully annotated and the raw expression counts are maintained. See Reference data to browse other reference data sets that we have compiled. In this tutorial, we assume that you have already installed PySCN. We use the terms “reference” and “training” interchangebly.

Training data#

We use the “10k PBMCs from a Healthy Donor (v3 chemistry) Single Cell Gene Expression Dataset by Cell Ranger 3.0.0” data set from 10X Genomics.

Load packages and raw data#

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
import scipy as sp
import numpy as np
import anndata
import pySingleCellNet as pySCN
import igraph as ig
import altair as alt
from joblib import dump, load
import sys

sc.settings.verbosity = 3 
sc.logging.print_header()
scanpy==1.9.3 anndata==0.8.0 umap==0.5.3 numpy==1.23.5 scipy==1.10.1 pandas==2.0.0 scikit-learn==1.2.2 statsmodels==0.13.5 python-igraph==0.9.11 pynndescent==0.5.8

Load 10k PBMC data

ad10f = sc.read_10x_h5("../../dat/pbmc/pbmc_10k_v3_filtered_feature_bc_matrix.h5")
ad10f.var_names_make_unique()
ad10f
reading ../../dat/pbmc/pbmc_10k_v3_filtered_feature_bc_matrix.h5
 (0:00:00)
/opt/homebrew/Caskroom/miniforge/base/envs/test1/lib/python3.10/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
AnnData object with n_obs × n_vars = 11769 × 33538
    var: 'gene_ids', 'feature_types', 'genome'

Quality control#

Start the QC process. pySCN.mito_rib computes the proportion of counts derived from mitochondrially-encoded genes and ribosomal genes. By default, it will also remove these genes from the annData object.

adstart = ad10f.copy()
adstart = pySCN.mito_rib(adstart, species='HS', clean = True)
adstart
AnnData object with n_obs × n_vars = 11769 × 33421
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4), gridspec_kw={'wspace':0.25})
ax1_dict = sc.pl.scatter(adstart, x='total_counts', y='pct_counts_mt', ax=ax1, show=False)
ax2_dict = sc.pl.scatter(adstart, x='total_counts', y='n_genes_by_counts',ax=ax2, show=False)
plt.show()
../_images/aeea371584f481805d683016504bbd2a5dd205becdc6e710ccdb2b08ac424fcb.png

Let’s keep cells that …

  • mt% < 20

  • n_genes >= 500

  • n_counts <= 30,000

And exclude genes that are detected in 2 or fewer cells

adstart = adstart[adstart.obs['pct_counts_mt']<20,:].copy()
adstart.n_obs
sc.pp.filter_cells(adstart, min_genes=500)
adstart.n_obs
sc.pp.filter_cells(adstart, max_counts=30000)
adstart.n_obs
sc.pp.filter_genes(adstart, min_cells=3)
adstart.n_vars
filtered out 139 cells that have less than 500 genes expressed
filtered out 7 cells that have more than 30000 counts
filtered out 13317 genes that are detected in less than 3 cells
20104

Normalize and reduce dimentionality#

Normalize expression, detect highly variable genes (HVG), and run PCA.

Note

pySCN.norm_hvg_scale_pca does not scale expression values by default. Set gene_scale = True if you want to scale gene expression.

adNorm = pySCN.norm_hvg_scale_pca(adstart)
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
computing PCA
    on highly variable genes
    with n_comps=100
    finished (0:00:03)

Plot variance ratios

plt.rcParams["figure.figsize"] = (8,4)
sc.pl.pca_variance_ratio(adNorm, 50)
../_images/6fa6c450c419a51ce61bb30b00c0dc3c75fe0e17a5e7e781cc24179c033a1b25.png

Embed data and cluster#

Embed with UMAP based on first 11 PCs, and see how QC metrics track with clusters.

npcs = 11
sc.pp.neighbors(adNorm, n_neighbors=20, n_pcs=npcs)
sc.tl.leiden(adNorm,.1)

sc.tl.paga(adNorm)
sc.pl.paga(adNorm, plot=False)
sc.tl.umap(adNorm, 0.25, init_pos='paga')

plt.rcdefaults()
#plt.subplots(layout="constrained")
sc.pl.umap(adNorm,color=['n_counts', 'n_genes', 'pct_counts_mt', 'pct_counts_ribo'], alpha=.75, s=10)
computing neighbors
    using 'X_pca' with n_pcs = 11
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:05)
running Leiden clustering
    finished: found 9 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
running PAGA
    finished: added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
--> added 'pos', the PAGA positions (adata.uns['paga'])
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:04)
../_images/1077bd7a5611ff3f39855c2bfae8f51121d60b5ca8dcd5c4fb299dd3ce94e084.png

Cluster annotation#

Manually annotate the cells based on established markers of various PBMC cell types.

marker_genes_dict = {
    'B cell': ['CD79A', 'MS4A1', "PAX5"],
    'Dendritic': ['FCER1A', "FLT3","CD1C"],
    'CD14 monocyte': ['CD14', "CD36"],
    'FCGR3A monocyte': ['FCGR3A', 'MS4A7'],
    'NK cell': ['GNLY',"KLRD1", "PRF1"],
    'Megakaryocyte': ['PPBP',"PF4", "GNG11"],
    'CD4 T cell': ['CD3D', 'IL7R', "CD3E"],
    'CD8 T cell': ['CD8A', "EOMES"]
}
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4), gridspec_kw={'wspace':0.25})
ax1_dict = sc.pl.umap(adNorm,color=['leiden'], alpha=.75, s=10, legend_loc='on data', ax=ax1, show=False)
ax2_dict = sc.pl.dotplot(adNorm, marker_genes_dict, 'leiden', dendrogram=True,ax=ax2, show=False)
plt.show()
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 100
Storing dendrogram info using `.uns['dendrogram_leiden']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B cell, Dendritic, CD14 monocyte, etc.
../_images/0882ead9d90a7eadf5b728a9699ceab96adfc14531616aa73137eae60531173a.png

Based on the expression of these marker genes, here is an initial guess at their identities:

  • 6: dendritic, but split more because it looks like there might be some monocytes in there

  • 1: CD14+ monocyte

  • 4: split more (dual monocyte type) or doublet

  • 2: b-cell possible pre/pro

  • 7: split further (dual T- B-cell) or doublet

  • 8: megakaryocyte

  • 0: CD4+ T-cell

  • 3: CD8+ T-cell

  • 5: NK cell

Since some of the clusters might contain mixtures of cell types, or are doublets, try to fractionate them further.

sc.tl.leiden(adNorm,.15, restrict_to=["leiden",["4", "6", "7"]])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4), gridspec_kw={'wspace':0.25})
ax1_dict = sc.pl.umap(adNorm,color=['leiden_R'], alpha=.75, s=10, legend_loc='on data', ax=ax1, show=False)
ax2_dict = sc.pl.dotplot(adNorm, marker_genes_dict, 'leiden_R', dendrogram=True,ax=ax2, show=False)
plt.show()
running Leiden clustering
    finished: found 13 clusters and added
    'leiden_R', the cluster labels (adata.obs, categorical) (0:00:00)
WARNING: dendrogram data not found (using key=dendrogram_leiden_R). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 100
Storing dendrogram info using `.uns['dendrogram_leiden_R']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B cell, Dendritic, CD14 monocyte, etc.
../_images/94b188da4ef1949482638e453cc2a5692124682e9461c3473b69c73e302b3e93.png

Based on these results, our final cluster annotation is as follows:

  • 4-6-7,2: dendritic

  • 4-6-7,0: FCGR3A+ monocyte

  • 4-6-7,5: doublet

  • 1: CD14+ monocyte

  • 4-6-7,1:doublet

  • 2: B cell

  • 4-6-7,3: doublet

  • 4-6-7,4: doublet

  • 4-6-7,6: doublet

  • 8: Megakaryocyte

  • 0: CD4 T cell

  • 3: CD8 T cell

  • 5: NK cell

Leaving 8 legitimate clusters. So let’s remove the clusters that seem to be doublets.

tokeep = ["4-6-7,2","4-6-7,0", "1", "2", "8", "0", "3", "5"]
adClean2 = adstart.copy()
adClean2 = adClean2[adNorm.obs['leiden_R'].isin(tokeep)].copy()
adClean2
AnnData object with n_obs × n_vars = 10309 × 20104
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_mt', 'pct_counts_mt', 'n_genes', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
adNorm2 = pySCN.norm_hvg_scale_pca(adClean2)
adNorm2.obs['cluster'] = adNorm[adNorm2.obs.index].obs['leiden_R'].copy()
npcs = 11
sc.pp.neighbors(adNorm2, n_neighbors=20, n_pcs=npcs)
sc.tl.leiden(adNorm2,.1)
sc.tl.paga(adNorm2)
sc.pl.paga(adNorm2, plot=False)
sc.tl.umap(adNorm2, 0.25, init_pos='paga')
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
computing PCA
    on highly variable genes
    with n_comps=100
    finished (0:00:03)
computing neighbors
    using 'X_pca' with n_pcs = 11
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
running Leiden clustering
    finished: found 8 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
running PAGA
    finished: added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00)
--> added 'pos', the PAGA positions (adata.uns['paga'])
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:04)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4), gridspec_kw={'wspace':0.25})
ax1_dict = sc.pl.umap(adNorm2,color=['cluster'], alpha=.75, s=10, legend_loc='on data', ax=ax1, show=False)
ax2_dict = sc.pl.umap(adNorm2,color=['leiden'], alpha=.75, s=10, legend_loc='on data', ax=ax2, show=False)
plt.show()
../_images/b4f141d9fc16d6c9234580db18e09f243c4d29c757d2d8e6aacdfb31c184e554.png
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4), gridspec_kw={'wspace':0.25})
ax1_dict = sc.pl.umap(adNorm2,color=['leiden'], alpha=.75, s=10, legend_loc='on data', ax=ax1, show=False)
ax2_dict = sc.pl.dotplot(adNorm2, marker_genes_dict, 'leiden', dendrogram=True,ax=ax2, show=False)
plt.show()
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
    using 'X_pca' with n_pcs = 100
Storing dendrogram info using `.uns['dendrogram_leiden']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: B cell, Dendritic, CD14 monocyte, etc.
../_images/e506b0ffb2d49829517b15f7ef7970b4dcd14cb371e4a61ec8ff50d5f6fe9f4f.png

Add the final cell annotations to the annData object:

  • 6: Dendritic

  • 1: CD14+ monocyte

  • 5: FCGR3A+ monocyte

  • 2: B cell

  • 7: Megakaryocyte

  • 0: CD4 T cell

  • 3: CD8 T cell

  • 4: NK cell

cell_dict = {'Dendritic': ['6'],
             'CD14 monocyte': ['1'],
             'FCGR3A monocyte': ['5'],
             'B cell': ['2'],
             'Megakaryocyte': ['7'],
             'CD4 T cell': ['0'],
             'CD8 T cell': ['3'],
             'NK cell': ['4']
}
new_obs_name = 'cell_type'
adNorm2.obs[new_obs_name] = np.nan

for i in cell_dict.keys():
    ind = pd.Series(adNorm2.obs.leiden).isin(cell_dict[i])
    adNorm2.obs.loc[ind,new_obs_name] = i

adNorm2.obs['cell_type'] = adNorm2.obs['cell_type'].astype("category") 
sc.tl.dendrogram(adNorm2, "cell_type")
    using 'X_pca' with n_pcs = 100
Storing dendrogram info using `.uns['dendrogram_cell_type']`
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5), gridspec_kw={'wspace':0.4})
ax1_dict = sc.pl.umap(adNorm2,color=['cell_type'], alpha=.75, s=10, legend_loc='on data', ax=ax1, show=False)
ax2_dict = sc.pl.dotplot(adNorm2, marker_genes_dict, 'cell_type', dendrogram=True,ax=ax2, show=False)
plt.show()
../_images/984cdd097b6d6336c55c9bb92c0b4dfcdf5799c19bc900ccebe39a4b172f7993.png
adNorm2.obs['cell_type'].value_counts()
cell_type
CD4 T cell         3554
CD14 monocyte      3128
B cell             1450
CD8 T cell         1029
NK cell             608
FCGR3A monocyte     327
Dendritic           154
Megakaryocyte        59
Name: count, dtype: int64

Add cell_type annotation to cleaned raw data and save for use as reference data.

adTrainPBMC = adClean2.copy()
adTrainPBMC.obs['cell_type'] = adNorm2.obs['cell_type'].copy()
adTrainPBMC.write_h5ad("adPBMC_ref_040623.h5ad")