Benchmarking Missing Data Imputation Methods in Socioeconomic Surveys

Home - Papers

17 Apr 2026

Reproducible
Research

Missing data imputation is a core challenge in socioeconomic surveys, where data is often longitudinal, hierarchical, high-dimensional, not independent and identically distributed, and missing under complex mechanisms. Socioeconomic datasets like the Consumer Pyramids Household Survey (CPHS)—the largest continuous household survey in India since 2014, covering 174,000 households—highlight the importance of robust imputation, which can reduce survey costs, preserve statistical power, and enable timely policy analysis. This paper systematically evaluates these methods under three missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), across five missingness ratios ranging from 10% to 50%. We evaluate mputation performance on both continuous and categorical variables, assess the impact on downstream tasks, and compare the computational efficiency of each method. Our results indicate that classical machine learning methods such as MissForest and HyperImpute remain strong baselines with favorable trade-offs between accuracy and efficiency, while deep learning methods perform better under complex missingness patterns and higher missingness ratios, but face scalability challenges. We ran experiments on CPHS and multiple synthetic survey datasets, and found consistent patterns across them. Our framework aims to provide a reliable benchmark for structured socioeconomic surveys, and addresses the critical gap in reproducible, domain-specific evaluation of imputation methods. The open-source code is provided in Appendix A.2.

Citation:

Benchmarking missing data imputation methods in socioeconomic surveys, Siyi Sun, David Antony Selby, Yunchuan Huang, Ayush Patnaik, Sebastian Vollmer, Seth Flaxman, Anisoara Calinescu, Transactions on Machine Learning Research, February 2026

AUTHORS