SCALABLE GAUSSIAN PROCESS MODELLING OF COUNTS DATA FROM RNA-SEQUENCING EXPERIMENTS
Abstract
The negative binomial distribution has been shown a good model for counts data
from both bulk and single-cell RNA-sequencing (RNA-seq). Gaussian process (GP)
regression provides a useful non-parametric approach for modeling temporal or spatial
changes in gene expression. However, currently available GP regression methods that
implement negative binomial likelihood models do not scale to the increasingly large
datasets being produced by single-cell technologies.
The GPcounts package implements GP regression methods for modelling counts
data using negative binomial and zero-inflated negative binomial likelihood functions.
Computational efficiency is achieved through the use of variational Bayesian inference.
The GP function is used to model changes in the mean of the negative binomial
likelihood through a logarithmic link function and the dispersion parameter is fitted
by maximum likelihood. We also provide the option of modelling additional dropout
using a zero-inflated negative binomial likelihood where the gene-specific drop-out
rate is linked to the inferred GP function through a Michaelis-Menten function. We
validate the method on simulated time course data, showing that it is better able to
identify changes in over-dispersed counts data over time than methods based on Gaussian
or Poisson likelihoods. To demonstrate temporal inference, we apply GPcounts
to single-cell RNA-seq datasets after pseudotime and branching inference. To demonstrate
spatial inference, we apply GPcounts to data from the mouse olfactory bulb to
identify spatially variable genes and we compare our results to those from a published
GP method with a Gaussian likelihood function. Our results show that the GPcounts
package can be used to model temporal, spatial, branching and transition in counts data
in cases where simpler Gaussian and Poisson likelihoods are unrealistic.