A Resource Of Microbiome Benchmark Datasets With Biological Ground Truth
Author(s): Samuel David Gamboa-Tuz, Levi Waldron, Marcel Ramos
Affiliation(s): CUNY Graduate School of Public Health and Health Policy
Little consensus yet exists on the best methods of differential abundance (DA) analysis in microbiome data analysis, with a range of classical statistical tests, methods adapted from the field of RNA-seq, and methods developed specifically for metagenomic data, in use. Even the importance of accounting for compositionality of microbiome profiles remains unclear. Independent benchmarking studies have demonstrated that current DA methods can have an inflated false discovery rate and little concordance of features labeled as DA. A barrier to development of consensus over appropriate selection of DA methods has been a lack of benchmark datasets of typical ecological complexity, but where some ground truth is known. We, therefore, identified six datasets of differing ecological complexity and types of “biological ground truth” that have been characterized through experimental data independently of metagenomic sequencing. These benchmark datasets can be leveraged to evaluate the correctness of discoveries of different DA methods on representative habitats including human oral cavity, gut, and urogenital tract. Datasets and their metadata are stored as CSV files in Zenodo to increase Findability, Accessibility, Interoperability, and Reusability (FAIR), and imported into R/Bioconductor as (Tree)SummarizedExperiment objects by the MicrobiomeBenchmarkData R package (to be submitted to Bioconductor). Furthermore, we use these datasets to benchmark the correctness of discoveries of commonly used DA methods. Preliminary results indicate that newly developed methods and centered-log-ratio transformation don’t help to identify the expected biological ground truth. We expect MicrobiomeBenchmarkData to facilitate the implementation and adoption of more optimal DA methods in microbiome research.