YL and CV for OBS

Introduction

MAYA (Multimodes of pathwAY Activation) enables comprehensive pathway study and automated cell type prediction thanks to multimodal activity scoring of gene lists (pathways from MSigDB or markers from PanglaoDB for instance) in individual cells.

MAYA provides functions to build a multimodal pathway activity matrix, compute UMAP to visualize cells in 2 dimensions based on this activity matrix, visualize the different modes of activation of a pathway in this space and also check the expression of top contributing genes of each mode in the dataset through heatmaps.

MAYA also provides a function to predict cell type based on PanglaoDB or any custom cell type markers list provided by the user. Cells can also be visualized in embeddings computed based on the marker activity matrix. MAYA is not run in a multimodal way for cell type prediction for interpretability purposes.

Installation

This package requires R >= 4.0.5 to run and can be installed using devtools:

install.packages("devtools")
devtools::install_github("one-biosciences/maya")

Use case: kidney dataset

Through this use case, we will demonstrate the insight that brings multimodal pathway activity analysis to apprehend the cellular heterogeneity of a dataset. We will show how to use the different functions available in this package to run a full analysis of the activity of MSigDB HALLMARK gene sets in normal kidney cells. We will also demonstrate how to use the integrated MAYA function to automatically predict cell types.

This example dataset was built using a count matrix and corresponding metadata made available by Young et al.(2018). In their paper, the authors thoroughly investigated the cellular identity of normal and cancerous human kidneys from 72,501 single-cell transcriptomes, which led to the identification of various cell types and subtypes of normal and cancerous immune and non-immune cell types.

The matrix was filtered to select only protein coding genes and keep only 5 groups of normal cells:

  • AV2 (Vascular endothelium – ascending vasa recta)
  • G (Glomerular epithelium - podocytes)
  • M (Mesangial cells - myofibroblasts)
  • 8T (CD8 T cells)
  • MNP1 (Mononuclear phagocytes)

Loading data

library(MAYA)

First, load the example count matrix available in the package:

path<-system.file("extdata", "kidney_count_matrix.tsv", package = "MAYA")
count_mat<-Matrix::Matrix(as.matrix(read.table(file=path,sep="\t",header=T,check.names = F)),sparse=TRUE)
dim(count_mat)
## [1] 19720  1252

The number of cells was intentionally chosen small to ensure reasonable running time for this example.

Then, load the associated metadata:

path<-system.file("extdata", "kidney_metadata.tsv", package = "MAYA")
meta<-read.table(file=path,sep="\t",header=T)
dim(meta)
## [1] 1252    5
colnames(meta)
## [1] "nCount_RNA"   "nFeature_RNA" "Alias"        "Cell_type1"   "Cell_type2"

We have several variables available to characterize our cells:

  • “nCount_RNA” (number of molecules counted)
  • “nFeature_RNA” (number of genes expressed)
  • “Alias” (short cell subtype name)
  • “Cell_type1” (cell type description)
  • “Cell_type2” (cell subtype description)

“Alias” is the annotation that will be preferred to describe our cells as it corresponds to the annotation of the 5 groups described above.

table(meta$Alias)
## 
##   8T  AV2    G    M MNP1 
##  155  305  259  118  415

Running MAYA_pathway_analysis

The user simply needs a raw count matrix and to specify in modules_list either “hallmark” or “kegg” to load prebuilt versions of the corresponding MSigDB pathway lists, or provide any list of pathways with their associated markers (they can be loaded from gmt files with the function MAYA::read_gmt).

activity_summary<-MAYA_pathway_analysis(expr_mat=count_mat,
                                        modules_list = "hallmark",
                                        is_logcpm=F)
## Running pathway analysis

## Loading HALLMARK from MSigDB

## Computing gene sets activity

## Found at least one informative activation mode for 20 gene sets

## 'as(<dsCMatrix>, "dgTMatrix")' is deprecated.
## Use 'as(as(., "generalMatrix"), "TsparseMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").

The result is a list containing the main fields:

  • activity_matrix: mode by cell matrix
  • umap: activity-based embeddings for result exploration
  • PCA_obj: detailed result of activity analysis by pathway, required for some plotting functions

The user can then visualize the activity matrix as a heatmap:

plot_heatmap_activity_mat(activity_mat = activity_summary$activity_matrix, 
                          meta = meta, 
                          annot_name = "Alias")

The user can also choose to scale each mode between 0 and 1, which sometimes help homogenize the visualization when some modes have high activation compared with others.

plot_heatmap_activity_mat(activity_mat = scale_0_1(activity_summary$activity_matrix), 
                          meta = meta, 
                          annot_name = "Alias")

The user can also check how cells from different cell types organize in MAYA embeddings based on hallmark pathways activity:

plot_umap_annot(umap=activity_summary$umap,
                labels = meta$Alias,
                title = "Author annotation - HALLMARK")

The user can then visualize multimodal activity of a specific pathway on the UMAP - the allograft rejection for instance - and check the expression of top 10 contributing genes for each mode. For this purpose, the user might want to plot the logCPM counts and therefore normalize the matrix at this stage.

plot_umap_pathway_activity(umap=activity_summary$umap,
                           PCA_object = activity_summary$PCA_obj,
                           module = "HALLMARK_ALLOGRAFT_REJECTION")

logcpm<-logcpmNormalization(count_mat)
plot_heatmap_pathway_top_contrib_genes(expr_mat=logcpm,
                                       PCA_object = activity_summary$PCA_obj,
                                       module = "HALLMARK_ALLOGRAFT_REJECTION",
                                       n = 10,
                                       meta = meta,
                                       annot_name = "Alias")

The expression of top contributing genes can also be plotted on the UMAP.

plot_umap_gene(umap=activity_summary$umap,
               expr_mat = logcpm,
               gene = "CTSS")

Finally, the user can compute and represent the specificity of each mode in the different cell types.

plot_pathway_specificity(PCA_object = activity_summary$PCA_obj,
                         module = "HALLMARK_ALLOGRAFT_REJECTION",
                         meta = meta,
                         annot_name = "Alias")

Running MAYA_predict_cell_type

To run the cell type prediction module in the simplest way, one only needs a raw count matrix, and the algorithm will load the full PanglaoDB to try to identify cell types. However some parameters might be important to have in mind like min_cells_pct which corresponds to the minimal fraction of cells you expect a cell type to represent in the whole dataset (default is 5%), organs to directly specify the tissue you are working on to avoid possible interferences between similar cell types located in different organs, is_logcpm that should be set to true if one already performed a custom data normalization, nCores to speed up the process if possible and compute_umap to disable this option that can take more time as the size of the dataset increases.

# activity_summary<-MAYA_predict_cell_types(expr_mat = count_mat)
activity_summary<-MAYA_predict_cell_types(expr_mat = count_mat,
                                    min_cells_pct = 0.05,
                                    organs = "Kidney",
                                    is_logcpm = FALSE,
                                    nCores = 1,
                                    compute_umap = T)
## Running cell type identification

## Loading PanglaoDB

## Computing gene sets activity

## Found at least one informative activation mode for 35 gene sets

## Computing UMAP on activity matrix

The result is a list containing the main fields:

  • cell_annotation: vector with cell type for each cell or “unassigned”
  • activity_matrix: mode by cell matrix
  • umap: activity-based embeddings for result exploration

The user can visualize cell annotation in MAYA embeddings, compare it with the authors annotation and visualize the underlying activity matrix.

plot_umap_annot(umap=activity_summary$umap,
                labels = activity_summary$cell_annotation,
                title = "MAYA cell type prediction")

plot_umap_annot(activity_summary$umap,labels = meta$Alias,title = "Author annotation")

plot_heatmap_activity_mat(activity_mat = activity_summary$activity_matrix, 
                          meta = meta, 
                          annot_name = "Alias")

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] MAYA_1.0.1        leidenbase_0.1.30 viridis_0.6.5     viridisLite_0.4.2
##  [5] wesanderson_0.3.7 pheatmap_1.0.12   RANN_2.6.1        igraph_2.0.3     
##  [9] umap_0.2.10.0     ggplot2_3.5.1     dplyr_1.1.4       Matrix_1.7-0     
## [13] rmarkdown_2.28   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5       jsonlite_1.8.8     highr_0.11         compiler_4.4.1    
##  [5] tidyselect_1.2.1   Rcpp_1.0.13        gridExtra_2.3      scales_1.3.0      
##  [9] png_0.1-8          yaml_2.3.10        fastmap_1.2.0      reticulate_1.38.0 
## [13] lattice_0.22-6     R6_2.5.1           labeling_0.4.3     generics_0.1.3    
## [17] knitr_1.48         tibble_3.2.1       munsell_0.5.1      openssl_2.2.1     
## [21] RColorBrewer_1.1-3 pillar_1.9.0       rlang_1.1.4        utf8_1.2.4        
## [25] xfun_0.47          cli_3.6.3          withr_3.0.1        magrittr_2.0.3    
## [29] digest_0.6.37      grid_4.4.1         askpass_1.2.0      lifecycle_1.0.4   
## [33] vctrs_0.6.5        RSpectra_0.16-2    evaluate_0.24.0    glue_1.7.0        
## [37] farver_2.1.2       fansi_1.0.6        colorspace_2.1-1   tools_4.4.1       
## [41] pkgconfig_2.0.3    htmltools_0.5.8.1