BClean+: A Bayesian Data Cleaning System With Automated Prior Generation

2026-04-15

 

Ziyan Han, Jing Zhu, Jinbin Huang, Sifan Huang, Yaoshu Wang, Rui Mao*, and Jianbin Qin*

Published at TKDE 2026


Abstract:

Probabilistic approaches, particularly Bayesian methods, are a cornerstone of data cleaning, yet they often depend on complex prior distributions that require costly and labor-intensive expert input. Our prior work, BClean, alleviated this burden by introducing automatic Bayesian network (BN) construction and lightweight user constraints (UCs), but it still fundamentally relies on manually provided prior knowledge. In this paper, we present BClean+, an enhanced Bayesian data cleaning system that extends BClean with a novel framework for automated prior generation. BClean+ leverages Large Language Models (LLMs) to identify attribute semantics and automatically synthesizes format patterns as UCs, while continuously maintaining a reusable template library. It also enhances BN construction through hierarchical structure discovery, improving interpretability and enabling more effective refinement for accurate inference. By integrating the automatically generated UCs into its Bayesian inference framework, BClean+ achieves more robust and accurate cleaning. Moreover, the framework generalizes to the synthesis of probabilistic programming language (PPL) code for systems such as PClean, thereby addressing a critical usability challenge in PPL-based data cleaning. Extensive experiments on real-world datasets demonstrate that BClean+ achieves an average F1-score of 0.89 (up to 0.98), outperforming state-of-the-art methods by 0.42 on average (up to 0.57), while reducing user configuration time from hours to under five minutes, with an average of 113.28× speedup in total runtime over BClean and other baselines.


[Download Paper]