Nullranges: Modular Workflow For Overlap Enrichment**
Author(s): Wancen Mu, Eric Scott Davis, Mikhail Dozmorov, Stuart Lee, Michael I Love, Douglas Phanstiel
Affiliation(s): University of North Carolina at Chapel Hill
There are many well-established packages for overlap enrichment in R/Bioconductor. These can be used to establish if two sets of genomic ranges are distributed closer to each other than expected under a particular null hypothesis. In this software demo we will focus on two branches of specification of null hypothesis for distribution of genomic ranges, where we find it is beneficial to separate generation of null ranges from the enrichment analysis steps. These are cases where the specification of the null hypothesis is complex in itself and deserves its own multiple steps, diagnostic considerations, and plots all covered in our workshop. Finally, we will demonstrate how nullranges plays a role in a tidy data workflow tying together multiple Bioconductor and tidyverse packages. The first branch of null hypothesis specification allows users to generate matched ranges that control for specific confounding characteristics. Since the distribution of these characteristics often differs between the set of interest (the focal set) and the pool of candidate ranges, an appropriate null set must be matched to the characteristics of the focal set. We have implemented a propensity score-based method for performing covariate-matched subset selection. Our implementation is efficient for operating on genome-scale data and tightly integrated with existing Bioconductor classes. Additionally, we have provided accessor methods and plotting functions for visualizing and assessing matching quality and covariate balance. Another branch is to perform bootstrap resampling on blocks of the genome containing an original set of ranges, preserving the ranges clustering properties, possibly considering an exclusion list of regions where ranges should not be located. The algorithm follows the genomic block bootstrap (Bickel et al 2010). Our implementation uses efficient vectorized code for generating bootstrap ranges from input GRanges objects. We have implemented options for bootstrapping with respect to a segmented genome, to deal with highly heterogeneous range distributions. We will discuss considerations of segmentation, block length, and their impact on the hypothesis testing, in comparison to shuffling start positions of ranges. After generation of a set of ranges representing the null hypothesis, we will demonstrate use of plyranges as the engine for downstream overlap enrichment analysis or other analyses. Other possible downstream analyses made possible with nullranges + plyranges include computing correlations of sample data for all overlapping pairs of ranges, and optimizing an effect size threshold for differential analysis with the use of penalized splines. For the former we will also demonstrate complementary analysis with tidySE.