Optimizing signal and correcting for between-cell-type biases in heterogenous spatial and single-cell RNA-seq

Optimizing signal and correcting for between-cell-type biases in heterogenous spatial and single-cell RNA-seq


Author(s): Jared T Brown,Lingxin Cheng,Dylan Cable,Zijian Ni,Chitrasen Mohanty,Matthew Bernstein,Christina Kendziorski Newton,Rafael Irizarry

Affiliation(s): Department of Data Science, Dana Farber Cancer Institute



Proper normalization is an integral step in any RNA-seq preprocessing or analysis pipeline. While methods are well studied for older sequencing technologies, more recent developments still present significant challenges. Of note, the cell-type and tissue heterogeneity represented by these datasets has increased dramatically; both in single-cell and especially in spatial sequencing. When differences between cell-types are characterized by a small number of highly expressing genes, biases in the estimated normalization factors can be introduced, leading to excess false positives in downstream analysis. Further, the sparsity in these data confounds methods which aim to account for these effects, at best reducing the precision in estimated factors against what might otherwise be achieved. Here, we describe Rhino (robust heterogeneity integration and normalization), a largely automated model of observed gene expression across cells/samples. Rhino solves the normalization problem by using observed expression across all genes to estimate corrective factors within cell-type and by implicitly reducing to the set of equivalently expressed genes across cell-types. As Rhino internally models cell-type heterogeneity as a reduced-dimensional latent space, our method resolves the question of whether to cluster or normalize first. Further, Rhino utilizes a GLM framework to identify which genes are variable across cell-types, thereby allowing both method and user to identify significantly variable genes based on statistical evidence rather than heuristics. We demonstrate the advantages and capabilities of Rhino on simulated and case-study datasets.