BClean+: A Bayesian Data Cleaning System with Automated Prior Generation

2026-04-15

 

Ziyan Han, Jing Zhu, Jingbin Huang, Rui Mao*, and Jianbin Qin*

Published at ICDE Demo 2026


Abstract:

We demonstrate BClean+, a Bayesian data cleaning system that unifies automated prior generation, data relationship modeling, and probabilistic inference into an end-to-end cleaning framework. BClean+ (1) uses large language models (LLMs) for zero-shot semantic labeling and then automatically synthesizes format patterns as user constraints, (2) constructs attribute dependency models through Bayesian networks with community detection, (3) performs maximum a posteriori (MAP)-based probabilistic inference with a compensative scoring model, and (4) supports optional synthesis of probabilistic programming language (PPL) code for reuse in PPL-based systems such as PClean. We demonstrate BClean+ for its (a) interactive interface that enables optional refinement of user constraints, Bayesian networks, and PPLs, (b) automation and interpretability, and (c) effectiveness and efficiency on real-world datasets. Extensive experiments show that BClean+ achieves an average F1-score of 0.934 and a 98.6× runtime speedup compared to competitive cleaning systems, while reducing user configuration time from hours to minutes.


[Download Paper]