Contents

1 Introduction

Escort is a framework for evaluating a single-cell RNA-seq dataset’s suitability for trajectory inference and for quantifying trajectory properties influenced by analysis decisions. Escort is an R package designed to guide users through the trajectory inference process by offering goodness-of-fit evaluations for embeddings that represent a range of analysis decisions such as feature selection, dimension reduction, and trajectory inference method-specific hyperparameters.

If you encounter any issues, please report these to our Github: Report Issues

2 Installation

Escort can be installed via GitHub:

if (!requireNamespace("devtools", quietly=TRUE))
    install.packages("devtools")
devtools::install_github("xiaorudong/Escort")

The following packages are required for installing Escort. If installation fails, you may need to manually install the dependencies using the function ‘install.packages’ for CRAN packages or ‘BiocManager::install’ for Bioconductor packages. The error message will indicate which packages are missing, or you can go ahead and pre-install these.

CRAN packages: alphahull, fastICA, Rtsne, umap, rstatix, DT, FNN, grDevices, Hmisc, jmuOutlier, shotGroups, parallelDist, robustbase, scales, SCORPIUS, sfsmisc shiny, shinycssloaders, shinydashboard, shinyjs, shinyWidgets, dplyr

Bioconductor packages: SingleCellExperiment, SC3, slingshot, clusterProfiler, org.Hs.eg.db, org.Mm.eg.db, edgeR, limma

Bioconductor packages can be installed using:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("PACKAGE NAME")

CRAN packages can be installed using:

install.packages("PACKAGE NAME")

3 Example using a simulated single-cell RNA-seq dataset

Below we show an example using a simulated single-cell RNA-seq dataset from Saelens et al., 2019, available at (https://zenodo.org/records/1443566). For this example, we pre-normalized the data and removed genes having mean expression less than three. This simulated dataset is characterized by a linear topology structure. We will also set a seed here for reproducibility.

set.seed(11111)
library(Escort)
data("exampleData_linear")

Escort first assists users by determining the presence of a trajectory signal. This step provides information regarding the appropriateness of the dataset for trajectory analysis. Often trajectory analysis is not appropriate when cells represent distinct cell types or when cells are too homogeneous.

4 Step 1: Detecting Trajectory Existence

The input for this step are the matrices of raw and normalized single-cell RNA-seq data after any quality control. The HD_DCClusterscheck() function is utilized to identify diverse cell types, while the testHomogeneous() function aids in assessing homogeneity.

4.0.1 Assessing evidence of any distinct cell types:

In practice, we do not recommend setting K below, it was done here only for speed.

LvsC <- HD_DCClusterscheck(normcounts=norm_counts, rawcounts=rawcounts)
LvsC$DCcheck
## [1] "Congratulations! Escort did not find null spaces between clusters. Proceed to the homogeneity check step next."

4.0.2 Assessing evidence of cell homogeneity:

In practice, we recommend setting the number of iterations to at least 20. We set it to 5 here for speed.

cor_test <- step1_testHomogeneous(normcounts=norm_counts, num.sim=5)
cor_test$decision
## [1] "The trajectory signal is detected."

5 Step 2: Evaluating embedding-specific characteristics

Escort’s next step is to identify dataset-specific preferred embeddings for performing trajectory inference. An embedding encompasses all analysis or processing choices involved prior to trajectory inference (e.g. dimension reduction, feature selection, normalization, etc.)

For feature selection, usually the top highly variable genes are selected, although there is no pre-determined optimal number of genes to select. For dimension reduction, Escort has built-in the more well-known dimension reduction algorithms, specifically, UMAP, t-SNE, MDS, and PCA, through the getDR_2D() function.

Here we will evaluate an embedding consisting of selecting the top 20% of highly variable genes followed by PCA for dimension reduction. The modelGeneVar() function from scran package is used here to identify highly variable genes, although any procedure could be used.

gene.var <- quick_model_gene_var(norm_counts)
genes.HVGs <- rownames(gene.var)[1:1000] # (approx. 20%)

embedding1 <- getDR_2D(norm_counts[genes.HVGs,], "PCA")
head(embedding1)
##              PC1       PC2
## Cell_1 -3.388265 -4.557812
## Cell_2 -4.109442 -5.199206
## Cell_3 -4.486174 -3.002583
## Cell_4 -3.928438 -4.177772
## Cell_5 -2.948886 -3.293555
## Cell_6 -4.220721 -4.077668

5.0.1 Examining cell connectivity on embeddings

Escort first evaluates cell connectivity on the embedding using the LD_DCClusterscheck() function. Since the presence of distinct clusters w=has already been performed in the first step, then a reliable embedding should not exhibit distinct clusters. Thus, any embeddings found to be disconnected are classified immediately as Non-recommended.

DRLvsC <- LD_DCClusterscheck(embedding=embedding1)
DRLvsC$DCcheck
## [1] "Congratulations! Escort did not find null spaces between clusters. Proceed to the next evaluation step."

5.0.2 Examining preservation of cell relationships

Next, Escort assesses the effectiveness of the embedding at preserving inter-cellular relationships that were present in the high-dimensional data using the Similaritycheck() function. The output percentage represents the rate at which cells in the embedding successfully maintain inter-cellular relationships. A higher percentage indicates a greater preservation of relationships among cells.

simi_cells <- Similaritycheck(clusters=LvsC, dimred=embedding1)
simi_cells$GoodRate
## [1] 0.934

5.0.3 Examining cell density

Escort next evaluates the distribution of cells within the embedding. A tighter distribution of cells suggests an enhanced capacity to achieve higher-quality trajectory fits. In this context, we compute the cell coverage rate as a representation of cell density by applying GOFeval() function.

gof_eval <- GOFeval(embedding1)
gof_eval$occupiedRate
## [1] 0.353

6 Step 3: Quantifying Trajectory Fitting Performance

Now that the embedding has been evaluated independently, the final step involves assessing the embedding within the framework of a specific trajectory inference method to allow for evaluating method-specific hyperparameters. In this step, Escort will evaluate the combined embedding and inference method’s proportion of cells having an ambiguous projection to the preliminary trajectory.

Here, we use Slingshot for trajectory fitting:

library(slingshot)
library(mclust)
cls <- Mclust(embedding1)$classification
ti_out <- slingshot(data=embedding1, clusterLabels=cls)
rawpse <- slingPseudotime(ti_out, na=T)
ls_fitLine <- lapply(slingCurves(ti_out), function(x) x$s[x$ord,])

library(grDevices)
library(RColorBrewer)
colors <- colorRampPalette(brewer.pal(11,'Spectral')[-6])(100)
plotcol <- colors[cut(rawpse, breaks=100)]
plot(embedding1, col = plotcol, pch=16, cex=.5)
lines(SlingshotDataSet(ti_out), lwd=2, col='black')

6.0.1 Examining ambiguous cells

After fitting a trajectory, we use the prepTraj() function in Escort to generate an object that stores all relevant information about the embedding and trajectory. The fitted trajectory line is saved as segments between pairs of points within the prepTraj() function. Subsequently, the UshapeDetector() function is employed to calculate the proportion of ambiguous cells. For trajectories estimated to be smooth, such as those generated by Slingshot, it is recommended to use outlierdetect='neutral' in the UshapeDetector() function. Conversely, for convoluted trajectories like those produced by Monocle 3, it is suggested to use outlierdetect='asymmetric'. A smaller number of ambiguous cells indicates a more accurate estimation of pseudotime along the trajectory.

fitLine <- segFormat(ls_fitLine) # needed when using Slingshot to get the preliminary trajectory

resobj <- prepTraj(embedding1, PT=rawpse, fitLine=fitLine)

ushap_eval <- UshapeDetector(resobj)
ushap_eval$Ambpct
## [1] 0.006

7 Scoring System

Finally, a comprehensive score is computed to assess the overall performance of each embedding. This score encompasses all four components evaluated in Steps 2 and 3: cell connectivity (DCcheck), preservation of cell relationships (SimiRetain), cell density (GOF), and ambiguous cells (USHAPE). Results from these steps are stored in a data frame with columns: DCcheck, SimiRetain, GOF, and USHAPE. This data frame serves as input for the score_cal() function. Each embedding receives a score, indicating the recommended level for constructing a trajectory. Embeddings with a score greater than zero are reported as Recommended by Escort, while those with a score less than or equal to zero are considered Non-recommended.

scoredf <- data.frame(DCcheck=DRLvsC$ifConnected, SimiRetain=simi_cells$GoodRate,
                      GOF=gof_eval$occupiedRate, USHAPE=ushap_eval$Ambpct)
calcScore(scoredf)
##   Row.names DCcheck SimiRetain   GOF USHAPE Scaled_GOF Scaled_USHAPE score ranking              decision note
## 1         1    TRUE      0.934 0.353  0.006      0.334         0.793 1.053       1 Recommended Embedding   NA

8 SessionInfo

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3          mclust_6.1.1                slingshot_2.12.0            TrajectoryUtils_1.12.0      SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 Biobase_2.64.0              GenomicRanges_1.56.1        GenomeInfoDb_1.40.1         IRanges_2.38.1              S4Vectors_0.42.1            BiocGenerics_0.50.0         MatrixGenerics_1.16.0       matrixStats_1.3.0           princurve_2.1.6             Escort_0.1.13               BiocStyle_2.32.1           
## 
## loaded via a namespace (and not attached):
##   [1] coin_1.4-3                R.methodsS3_1.8.2         SC3_1.32.0                DT_0.33                   Biostrings_2.72.1         TH.data_1.1-2             vctrs_0.6.5               spatstat.random_3.3-1     digest_0.6.36             png_0.1-8                 proxy_0.4-27              pcaPP_2.0-4               ggrepel_0.9.5             org.Mm.eg.db_3.19.1       deldir_2.0-4              dynparam_1.0.2            MASS_7.3-61               reshape2_1.4.4            httpuv_1.6.15             foreach_1.5.2             qvalue_2.36.0             withr_3.0.0               xfun_0.45                 ggfun_0.1.5               survival_3.7-0            doRNG_1.8.6               memoise_2.0.1             proxyC_0.4.1              clusterProfiler_4.12.0    gson_0.1.0                tidytree_0.4.6            zoo_1.8-12                pbapply_1.7-2             R.oo_1.26.0               DEoptimR_1.1-3            KEGGREST_1.44.1           promises_1.3.0            httr_1.4.7               
##  [39] rstatix_0.7.2             ps_1.7.7                  rstudioapi_0.16.0         UCSC.utils_1.0.0          generics_0.1.3            DOSE_3.30.1               processx_3.8.4            zlibbioc_1.50.0           sfsmisc_1.1-18            ggraph_2.2.1              polyclip_1.10-6           GenomeInfoDbData_1.2.12   SparseArray_1.4.8         xtable_1.8-4              stringr_1.5.1             desc_1.4.3                pracma_2.4.4              doParallel_1.0.17         evaluate_0.24.0           S4Arrays_1.4.1            hms_1.1.3                 bookdown_0.40             irlba_2.3.5.1             colorspace_2.1-0          SCORPIUS_1.0.9            jmuOutlier_2.2            ROCR_1.0-11               reticulate_1.38.0         spatstat.data_3.1-2       magrittr_2.0.3            readr_2.1.5               later_1.3.2               viridis_0.6.5             modeltools_0.2-23         ggtree_3.12.0             lattice_0.22-6            spatstat.geom_3.3-2       robustbase_0.99-3        
##  [77] shadowtext_0.1.3          cowplot_1.1.3             class_7.3-22              pillar_1.9.0              nlme_3.1-165              iterators_1.0.14          compiler_4.4.1            RSpectra_0.16-1           stringi_1.8.4             shinycssloaders_1.0.0     TSP_1.2-4                 plyr_1.8.9                crayon_1.5.3              abind_1.4-5               gridGraphics_0.5-1        locfit_1.5-9.10           sp_2.1-4                  graphlayouts_1.1.1        org.Hs.eg.db_3.19.1       bit_4.0.5                 sandwich_3.1-0            libcoin_1.0-10            dplyr_1.1.4               fastmatch_1.1-4           codetools_0.2-20          multcomp_1.4-25           openssl_2.2.0             bslib_0.7.0               e1071_1.7-14              mime_0.12                 splines_4.4.1             Rcpp_1.0.12               sparseMatrixStats_1.16.0  HDO.db_0.99.1             interp_1.1-6              knitr_1.48                blob_1.2.4                utf8_1.2.4               
## [115] WriteXLS_6.6.0            fs_1.6.4                  alphahull_2.5             DelayedMatrixStats_1.26.0 ggplotify_0.1.2           tibble_3.2.1              Matrix_1.7-0              statmod_1.5.0             tzdb_0.4.0                tweenr_2.0.3              pkgconfig_2.0.3           pheatmap_1.0.12           sgeostat_1.0-27           tools_4.4.1               cachem_1.1.0              RSQLite_2.3.7             viridisLite_0.4.2         DBI_1.2.3                 fastmap_1.2.0             rmarkdown_2.27            scales_1.3.0              grid_4.4.1                shinydashboard_0.7.2      broom_1.0.6               sass_0.4.9                patchwork_1.2.0           FNN_1.1.4                 BiocManager_1.30.23       carData_3.0-5             RANN_2.6.1                farver_2.1.2              tidygraph_1.3.1           scatterpie_0.2.3          yaml_2.3.9                cli_3.6.3                 purrr_1.0.2               lifecycle_1.0.4           askpass_1.2.0            
## [153] mvtnorm_1.2-5             backports_1.5.0           BiocParallel_1.38.0       gtable_0.3.5              umap_0.2.10.0             lmds_0.1.0                parallel_4.4.1            ape_5.8                   dynutils_1.0.11           limma_3.60.3              jsonlite_1.8.8            edgeR_4.2.1               ggplot2_3.5.1             shotGroups_0.8.2          bit64_4.0.5               assertthat_0.2.1          Rtsne_0.17                yulab.utils_0.1.4         spatstat.utils_3.0-5      ranger_0.16.0             RcppParallel_5.1.8        highr_0.11                jquerylib_0.1.4           GOSemSim_2.30.0           shinyjs_2.1.0             spatstat.univar_3.0-0     R.utils_2.12.3            rrcov_1.7-5               lazyeval_0.2.2            shiny_1.8.1.1             htmltools_0.5.8.1         enrichplot_1.24.0         GO.db_3.19.1              tinytex_0.51              glue_1.7.0                carrier_0.1.1             XVector_0.44.0            treeio_1.28.0            
## [191] dynwrap_1.2.4             RMTstat_0.3.1             gridExtra_2.3             boot_1.3-30               babelwhale_1.2.0          igraph_2.0.3              R6_2.5.1                  tidyr_1.3.1               CompQuadForm_1.4.3        cluster_2.1.6             rngtools_1.5.2            aplot_0.2.3               DelayedArray_0.30.1       tidyselect_1.2.1          ggforce_0.4.2             ash_1.0-15                car_3.1-2                 AnnotationDbi_1.66.0      fastICA_1.2-4             munsell_0.5.1             KernSmooth_2.23-24        splancs_2.01-45           data.table_1.15.4         htmlwidgets_1.6.4         fgsea_1.30.0              rlang_1.1.4               remotes_2.5.0             fansi_1.0.6               parallelDist_0.2.6