Modern data sets can be heterogeneous, containing non-linear or non-smooth relationships with explanatory or auxiliary variables in hierarchies and strata. This means that complex statistical models are needed for analysis but typical statistical modelling and distributional assumptions are unlikely or, at the very least, questionable. This has motivated the development and use of nonparametric and distribution free statistical methods for inference. However, such methods are generally high computational, rending them infeasible for use in large data settings. For example, some algorithms for quantile regression are of order O(N3), which quickly becomes computationally infeasible for even modestly sized data. Thus, new methods are needed to facilitate the efficient use of nonparametric and distribution free statistical methods in large data settings.
In this project, we propose the use of new experimental design methods to optimally subsample large data sets to efficiently and appropriately draw statistical inference. This will build upon recent work undertaken by members of the Centre for Data Science who developed the “Principles of experimental design for big data analysis”. However, in this work, the authors only scratched the surface of what experimental design methods can offer in the analyses of large data sets. Thus, we propose a suite of new experimental design methods based on non-parametric and distribution free approaches to data analysis. Our developments will therefore allow practitioners to conduct timely, informative, and efficient analyses in large data settings while relaxing some key statistical assumptions which are improbable in practice.
One advantage of applying the methods proposed above is the ability to assess how well the collected data match a supposed designed experiment. This is useful as, when dealing with large data sets, it is often thought they are representative, unbiased and fit-for-purpose. However, the basis for this is generally untested.
Based on matching experimental designs with the collected data, we will provide new methods to assess the representativeness, potential bias and usefulness of large data sets for answering specific research questions. This will allow researchers to understand the scope of their study, whether they are extrapolating or interpolating, estimate and potentially correct for biases in their analysis, and ultimately determine the reliability of the data they have collected for their intended purposes.