Predictive Modelling Of Dataset-Specific Single-Cell RNA-Seq Pipeline Performance
Author(s): Cindy Fang, Alina Selega, Kieran R. Campbell
Affiliation(s): University of Toronto
The advent of single-cell RNA-sequencing (scRNA-seq) has driven a plethora of computational methods development for all analysis stages, including filtering, normalisation, and clustering. With many choices for each step in the analysis pipeline available to practitioners, selecting the optimal workflow can be a difficult task. Considering an unrealistically simplistic example with only 3 analysis steps (e.g. filtering, normalisation, clustering), 4 methods for each step, and each method having only 2 possible parameter combinations gives $(4 imes2)^3 = 512$ possible pipelines. Given the far larger set of possibilities for steps, methods, and parameters, in practice the number of sensible pipelines that could be applied to scRNA-seq data is likely in the high thousands if not millions. While multiple existing benchmark studies can recommend the best performing method on average for each step, these are ultimately dependent on dataset characteristics. This leads to an interesting question: is it possible to predict how well a given pipeline will perform on a certain dataset? To begin to answer this, we have created a dataset consisting of the performance of 288 scRNA-seq clustering pipelines on 86 human datasets, quantifying performance using a variety of cluster purity metrics and gene set enrichment scores. Using this dataset, we build a series of supervised machine learning models that are able to predict pipeline performance for an unseen dataset given pipeline parameters and dataset characteristics. We find that on unseen datasets with author-provided ‘ground truth’ labels, pipelines predicted to perform well have clustering outputs that significantly correlate with these labels compared to those that were not predicted to perform well. Finally, by examining correlations of prediction performance with dataset-specific characteristics such as the number of genes, we identify which biological factors may impact the ability to predict pipeline performance.