ITEA is the Eureka Cluster on software innovation
ITEA is the Eureka Cluster on software innovation
ITEA 4 page header azure circular

Automated Data Cleaning for Tabular Data in ML Pipelines

Project
20219 IML4E
Type
New system
Description

In this achievement, Software AG contributed with three patents submitted to the US patent office during the IML4E project. The first patent deals with automated error detection in tabular data using meta learning and semi-supervised learning. The second patent introduces an innovative method to improve the data quality by locating the location of erronous data, before leveraging data augmentation to increase the proportion of clean data instances. The third patent proposes a reinforcement learning based method to select the best data cleaning based on the performance of target ML models.

Contact
Mohamed Abdelaal
Email
mohamed.abdelaal@softwareag.com
Research area(s)
Data Quality, Machine Learning, Data Cleaning
Technical features

Through the IML4E project, we developed several innovative tools for automating tabular data preparation. These tools address the critical need to improve data quality for more accurate machine learning models. One such tool, SAGED, is a meta-learning-based error detection system that leverages historical data to identify erroneous instances, reducing manual error correction efforts. Another tool, AutoCure, focuses on data curation by augmenting the proportion of clean data instances, effectively improving the signal-to-noise ratio within datasets. Lastly, ReClean, a reinforcement learning-based tool, optimizes the data cleaning process by automatically selecting the most appropriate cleaning tools based on their impact on downstream machine learning model performance.

SAGED identifies errors through a two-phase approach. The first phase trains binary classifiers on historical datasets to distinguish between erroneous and clean instances for each column. The second phase selects relevant pre-trained models based on the input dataset's characteristics. These models generate feature vectors, capturing historical knowledge, which are then used to train meta-classifiers for accurate error detection in the input data. AutoCure, on the other hand, utilizes an adaptive ensemble-based error detection module to isolate erroneous instances and a clean data augmentation module to increase the density of clean instances, improving overall data quality. Meanwhile, ReClean employs reinforcement learning agents to make sequential decisions, selecting optimal data repair operations based on their impact on machine learning model convergence. This approach, combining experience replay with the Reinforce algorithm, allows ReClean to learn effective repair strategies tailored to specific error patterns, leading to improved model accuracy and reliability.

Integration constraints

The various components of the data cleaning tool have been developed using Python3 and tested on Linux machines. They have been examined with the craft beers data set, the individual income tax data set, and the smart factory data set.

Targeted customer(s)

The target customers for the IML4E tools would be businesses and organizations that: (1) Heavily rely on data analysis and machine learning: This includes industries like finance, healthcare, marketing, and technology companies developing AI-driven solutions; (2) Deal with large volumes of tabular data: The tools are specifically designed for tabular data, making them ideal for organizations managing extensive datasets through databases or spreadsheets; (3) Require high-quality data for accurate insights: Industries where data accuracy is critical, such as healthcare for diagnosis or finance for risk assessment, would benefit greatly; (4) Seek to automate data preparation processes: Organizations looking to reduce manual effort and improve efficiency in their data cleaning and preparation pipelines would find these tools valuable; and (5) Want to optimize machine learning model performance: By ensuring cleaner data, these tools directly contribute to building more accurate and reliable machine learning models, a key concern for any organization leveraging ML.

Conditions for reuse

Open source (CC-BY 4.0)

Confidentiality
Public
Publication date
30-05-2024
Involved partners
Software AG (DEU)