ITEA is the Eureka Cluster on software innovation
ITEA is the Eureka Cluster on software innovation
Please note that the ITEA Office will be closed from 25 December 2024 to 1 January 2025 inclusive.
ITEA 4 page header azure circular

Data Quality Monitoring Dashboard

Project
20219 IML4E
Type
New service
Description

This tool presents an interactive data quality dashboard with a modular design, aiming to automate and streamline data quality management. It boasts automated features for data profiling, validation, error detection, and correction, minimizing manual effort and leveraging advanced algorithms to enhance accuracy. The dashboard integrates with tools like MLflow and Delta Lake, enabling seamless tracking of data quality processes, experiments, and version control. It also generates comprehensive DataSheets to ensure transparency and reproducibility of data quality management activities.

Contact
Mohamed Abdelaal
Email
mohamed.abdelaal@softwareag.com
Research area(s)
Data Quality, Automated ML Generation
Technical features

The data monitoring tool introduces a modular design for an interactive data quality dashboard, designed to streamline and automate multiple aspects of the data quality management process. This novel design brings significant enhancements to the field of data science, particularly in the areas of data profiling, validation, error detection, and correction. One of the key contributions of this design is the automated data profiling feature. Instead of manually defining rules for data profiling, this feature automatically extracts these rules, saving time and reducing the potential for human error. This automated extraction leverages advanced algorithms, considering both the statistical properties and domain-specific characteristics of the data.

The dashboard also includes an automated data validation and error detection module, capable of handling both quantitative and qualitative error types. This module utilizes ML techniques and statistical methods to identify inconsistencies, outliers, and other data quality issues, providing a comprehensive and accurate assessment of data quality. Another innovative feature of the dashboard is the automated data correction component. This module uses advanced algorithms to propose and apply corrections to identified data errors, reducing the manual intervention traditionally required in data cleaning processes. The tool has been designed with integration capabilities for common ML tracking tools, such as MLflow. This allows for seamless tracking of data quality experiments, models, and results, providing a unified view of both data quality management and ML processes. The system also includes an innovative integration with Delta Lake, enabling tracking of different data versions. This feature brings robustness to the data management process, allowing the tracking of changes over time, and facilitating rollback to previous data versions if required. The dashboard also supports iterative cleaning, running several cleaning iterations with the aim of optimizing the performance of downstream machine learning models. This process is designed to continually improve data quality and model performance over time. In this system, the role of the data owner is streamlined and focused, limited to validating the generated rules and/or corrections, if necessary, and labeling data samples to train ML models used for data validation or correction. This approach maximizes the value of domain expert input while minimizing the technical burden on them. In summary, this tool provides a comprehensive, automated, and user-friendly solution for data quality management, offering significant improvements in efficiency, effectiveness, and transparency.

Integration constraints

The following tools have been used for implementing the dashboard: Grafana, EvidentlyAI, Prometheus, MLflow. It can run on Linux machine.

Targeted customer(s)

Data scientists and machine learning engineers who require a comprehensive and automated solution for managing and improving the quality of their data can use the dashboard.

Conditions for reuse

will be soon released as an open source under the license: CC-BY 4.0

Confidentiality
Public
Publication date
31-01-2025
Involved partners
Software AG (DEU)