Metacell • spacedeconv

Building cell-type specific expression signatures from annotated single-cell data can be time comsuming. To reduce resource requirements spacedeconv contains access to metacell2. The software is able to reduce the dataset by computing robust “metacells” as a mixture of similar cells from the originating single cell dataset. This approach can improve the data input size by the factor 100.

In the following the metacell functions of spacedeconv are outlined and integrated in a workflow.

We use spacedeconvs sample data for this analysis. Since Metacell is python based we need to convert the SingleCellExperiment to an AnnData object. The original dataset contains 5789 cells.

library(spacedeconv)
#> → checking spacedeconv environment and dependencies
#> Configuring package 'spacedeconv': please wait ...
#> Done!
library(SpatialExperiment)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following object is masked from 'package:spacedeconv':
#> 
#>     normalize
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
data("single_cell_data_1")
single_cell_data_1
#> class: SingleCellExperiment 
#> dim: 29733 5789 
#> metadata(1): Samples
#> assays(1): counts
#> rownames(29733): RP11-34P13.7 FO538757.3 ... KRTAP9-2 IGLVIV-66-1
#> rowData names(1): ID
#> colnames(5789): CID4290A_AAACGGGAGACTGGGT CID4290A_AAAGTAGAGCGAAGGG ...
#>   CID4290A_TGCGTGGGTAGTAGTA CID4290A_TTTCCTCAGGCAGGTT
#> colData names(10): Sample Barcode ... celltype_minor celltype_major
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
ad <- spe_to_ad(single_cell_data_1) # convert to anndata

The main workflow consists of 2 mandatory and 2 optional functions:

Filter Dataset
(optional) Compute forbidden genes and modules
(optional) Extract forbidden genes from modules
Compute Metacells

Clean Genes and Cells

The first step filters the dataset to remove low quality genes and cells. It is possible to manually remove genes from the dataset and set specific UMI cutoffs. For more instructions please view the functions documentation.

filtered <- clean_genes_and_cells(ad)

(optional) compute forbidden genes

This function finds genes in the dataset which should not be included in the metacells and are calculated from hardcoded gene patterns. The function returns a list of genes which can be used as input for the metacell computation step. In addition gene modules are calculated and stored in the single cell object which is used in the third step of this workflow.

suspect_genes <- compute_forbidden_genes(filtered)

(optional) extract forbidden genes from gene modules

In this functions the genes from unsuited gene modules are extracted. You just have to provide a list of unsuited gene modules and the function returns an improved list of forbidden genes which can be used as input for the last step in this workflow.

TODO

here will be plots and instructions.

Compute Metacells

This function uses the filtered single cell data and optional forbidden genes to calculate metacells. Since metacells don’t have a cell type annotation we reannotate the metacells based on the original single cell data using the cell type column name you have to provide. It is further possible to select an AbundanceScore to further subset the metacells. The Abundance Score quantifies the purity of a metacells, namely the percentage of the “most Abundant cell” in the metacell compared to all cells. Not every cell merged to a metacell is of the same cell type in the original dataset. Using the Abundance Score we only keep metacells with more than 90% purity but other values can be used as well.

metacells <- compute_metacells(filtered, suspect_genes,
  cell_type_col = "celltype_major",
  abundance_score = 0.9
)

metacells <- readRDS(system.file("extdata", "metacells.rds", package = "spacedeconv"))

The result

The Input dataset was reduced in size drastically and now contains 30 cells with robust expression information. The new celltype annotation is stored in the “celltype” column. The column “grouped” contains the number of cells merged in this metacell while the “percentage” column stores the abundanceScore of this metacell.

metacells
#> class: SingleCellExperiment 
#> dim: 22299 32 
#> metadata(1): __name__
#> assays(3): counts scaled round
#> rownames(22299): RP11-34P13.7 FO538757.3 ... CTD-3222D19.10
#>   CTC-273B12.7
#> rowData names(5): excluded_gene clean_gene forbidden_gene
#>   pre_feature_gene feature_gene
#> colnames(32): 0 1 ... 31 32
#> colData names(5): grouped pile candidate celltype percentage
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):