Automatic String Data Validation with Pattern Discovery

2026-04-15

 


Ziyan Han, Xinwei Lin, Peng Di, Chuan Xiao, Makoto Onizuka, Jiuzhang Liu, Rui Mao*, and Jianbin Qin*

Published at DASFAA 2026


Abstract:

Periodic data insertions in enterprise data pipelines may propagate data quality issues downstream, potentially disrupting critical services. Although on-call engineers can investigate and fix such issues, identifying their root causes is often time-consuming. This paper presents AutoPattern, a self-validating data management system that automatically discovers patterns to validate semi-structured string data in enterprise pipelines. AutoPattern extracts patterns from historical data in a top-down manner, first inferring high-level structural skeletons to capture recursive and vertically aligned relationships, then performing fine-grained semantic refinement to balance generalization and specification. To address cold start and rapid data growth, we further introduce a data augmentation module and an incremental pattern update mechanism. Extensive experiments on public, synthetic, and industrial datasets demonstrate the effectiveness and efficiency of AutoPattern, with an average precision of 0.91 and a recall of 0.89, outperforming competitive baselines. A case study conducted on the industrial platform of Ant Group Inc., which hosts thousands of applications, further confirms its practicality in real-world production pipelines, where AutoPattern effectively captures meaningful patterns and assists engineers in rapid error localization.


[Download Paper]