SCArray.sat – Large-scale single-cell RNA-seq data analysis using GDS files and Seurat

SCArray.sat – Large-scale single-cell RNA-seq data analysis using GDS files and Seurat


Author(s): Xiuwen Zheng,Damian Stichel,Alice Wan

Affiliation(s): Genomics Research Center, AbbVie Inc., 1 North Waukegan Rd., North Chicago, IL 60064, US



Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of gene expression heterogeneity within complex biological systems. As scRNA-seq technology becomes increasingly accessible and cost-effective, experiments are generating data from larger and larger numbers of cells. However, the analysis of large scRNA-seq data remains a challenge, particularly in terms of scalability. While numerous analysis tools have been developed to tackle the complexities of scRNA-seq data, their scalability is often limited, posing a major bottleneck in the analysis of large-scale experiments. In particular, the R package Seurat is one of the most widely used tools for exploring and analyzing scRNA-seq data, but its scalability is limited by available memory. To address this issue, we introduce a new R package called “SCArray.sat” that extends the Seurat classes and functions to support Genomic Data Structure (GDS) files as a DelayedArray backend for data representation. GDS files store multiple dense and sparse array-based data sets in a hierarchical structure. This package defines a new class, called “SCArrayAssay” (derived from the Seurat class “Assay”), which wraps raw counts, normalized expressions, and scaled data matrices based on GDS-specific DelayedMatrix. It is designed to integrate seamlessly with the Seurat package to provide common data analysis in a workflow, with optimized algorithms for GDS data files. We demonstrate the utility and multi-core performance of SCArray.sat using both real and simulated large datasets, as well as the integration with the Bioconductor existing frameworks. Compared to Seurat, SCArray.sat significantly reduces memory usage and can be applied to ultra large datasets.