Sogeti’s Data Quality Wrapper - Automating your data pre-processing with Streamlit
Sogeti NL has a large data science team that’s always looking for methods to ensure transparency, ethics and quality in their AI development process. Additionally, we are involved in a project that focuses on testing AI models in various development phases — ITEA IVVES. As part of this project, we developed the Data Quality Wrapper (DQW), an app for automated EDA, preprocessing and data quality report generation. Its goal is to automate the preprocessing of data, but also educate aspiring and experienced data scientists about different methods that can be used to improve its quality.
While trying to create an app around this solution, we found Streamlit, a framework for easy app development for ML projects and experiments. I’ve already written about how easy it is to develop apps with it.
In this blog post, we will go though the purpose of the app and its sections, Streamlit components and packages used to develop the app. We will also point to the scripts where the code is located.
The purpose of the app
Sogeti’s DQW is used as an accelerator for another product we are developing, the Quality AI Framework, used to test AI models in all phases of the AI development cycle. The framework provides a practical and standardized way of working that outputs trustworthy AI. Sogeti’s DQW is used in the Data Understanding and Data Preparation phase of this framework. It is an accelerator used to ensure the quality of the data that goes into a given ML model is suitable and representative.
The best thing about the app is that it can be applied to more than one data structure, including:
- Structured data. Data in a well-defined format. Used in various ML applications.
- Unstructured data. This includes images, used in computer vision algorithms such as object detection and classification, text, used in NLP models, be it for classification or sentiment analysis and audio, used in audio signal processing algorithms such as music genre recognition and automatic speech recognition.
- Synthetic data. Synthetic data evaluation is a critical step of the synthetic data generation pipeline. Validating the synthetic data training set ensures model performance will not be impacted negatively.
These data formats define the app sections, which you can toggle through in the main selectbox. Each of the sections has multiple subsections, which we will go though in the next few paragraphs.
Packages used to enable the EDA (description, visualisation) and preprocessing (selection) of these data formats are below. Please note, these are packages that we recommend using, not a definite guide.
To read the entire blog, visit https://medium.com/sogetiblogsnl or download the PDF.
Other publications
This news article was also published in the following media:
Related projects
IVVES
Industrial-grade Verification and Validation of Evolving Systems