Search
Abstract

Daniel B. Larremore 1,2

1 Department of Computer Science, University of Colorado Boulder, Boulder, Colorado, United States of America
2 BioFrontiers Institute, University of Colorado Boulder, Boulder, Colorado, United States of America

"Measuring the overlap between two populations is, in principle, straightforward. Upon fully sampling both populations, the number of shared objects — species, taxonomical units, or gene variants, depending on the context — can be directly counted. In practice, however, only a fraction of each population’s objects are likely to be sampled due to stochastic data collection or sequencing techniques. Although methods exist for quantifying population overlap under subsampled conditions, their bias is well documented and the uncertainty of their estimates cannot be quantified. Here we derive and validate a method to rigorously estimate the population overlap from incomplete samples when the total number of objects, species, or genes in each population is known, a special case of the more general β-diversity problem that is particularly relevant in the ecology and genomic epidemiology of malaria. By solving a Bayesian inference problem, this method takes into account the rates of subsampling and produces unbiased and Bayes-optimal estimates of overlap. In addition, it provides a natural framework for computing the uncertainty of its estimates, and can be used prospectively in study planning by quantifying the tradeoff between sampling effort and uncertainty."



Corespondence: Daniel B. Larremore Email: daniel.larremore@colorado.edu