Data is at the centre of any ML (Machine Learning) model. But it exists in different kinds and formats including structured and unstructured forms in diverse locations with varying security and privacy requirements. Today, the bulk of the time and effort goes into collecting and collating large volumes of data in data science/ ML projects. Backed by the assumption that data is well-behaved, the parameters are optimized, models are fiddled with and rebuilt to make the model more robust. This model-centric approach to ML projects does not yield the best results. Since these models are built on low quality data, their accuracy and performance are directly impacted, leading to higher failure rates.
Research suggests the failure rate of ML models is nearly 90%; that only 1 in 10 data science projects make it into production. The high failure rate is attributed mainly to the challenges associated with data, especially the lack of quality and consistency across data. This, in turn, erodes the accuracy, performance and robustness of the ML models. As a result, there is a growing business case to shift to a data-centric approach to ML and data science models.
This article delves into what data-centric ML models are, why it’s being preferred by data scientists and how a data-centric approach is applied to fashion ML.
Model-Centric and Data-Centric Approaches in Machine Learning
Typically, ML projects are functions of code and data. To improve the model, one must either tune the data or the code or both. The model-centric approach to ML keeps the dataset unchanged after standard pre-processing and focuses on tuning and rebuilding the model/ code itself to improve the model. Stress is laid on collecting larger volumes of data and iterating and optimizing the model to deal with the noise and inconsistencies in data.
The data-centric approach to ML keeps the algorithm/ model fixed and iterates the data quality. It focuses on strengthening data quality and consistency. It stresses on the need for consistent labelling across batches of data and consistently analyzing and removing noisy labels and data errors. So, these models invest more on data quality tools rather than excessively expending time and resources on collecting larger volumes of data. It seeks to iterate, augment and improve the dataset to improve the performance of the model.
As the popular expression in the tech space goes, ‘Garbage in, garbage out’ – that when you input garbage into model training, you will get garbage output/ results. The data-centric approach improves the results, performance and robustness of ML models through better quality and consistent data. The question to ask is not ‘what kind of data to input to train a useful model?’ but ‘what data kind is necessary to train, measure and maintain the success of the model?’.
Why the Shift to Data-Centric Approach to ML?
Let us take an example to understand why the shift is necessary.
A fashion retailer is building an ML model to classify fashion images based on the sleeve length. After having tuned the parameters, the model has an accuracy rate of 70.5%. Given the goal is to achieve 95% accuracy, what can be done, especially when the base model is optimized and already good? With the model-centric approach, there is almost no more scope for improvement. That is not the case with the data-centric approach to ML.
If a data-centric approach is applied, noisy labels, low-quality data, data processing errors and other inconsistencies would be identified and rectified. The model would be augmented with additional data where necessary or the dataset itself changed. This way, the 95% accuracy can be achieved and maintained when the model is deployed in the production environment.
Applying The Data-Centric Approach to Fashion ML
AI Studio is a no-code, DIY autoML platform that is designed for fashion users, without/with technical and programming knowledge, to solve their real-life challenges. With a proven set of classifiers and detectors and the simple drag-and-drop interface, domain experts can build robust fashion AI models, without writing a single line of code. Each user can create fashion AI models that are tailored to their needs and context, thus, improving operational efficiency and ushering cost savings.
This unique platform adopts a data-driven and iterative approach to building robust, high-performing, production-ready fashion AI models. In the pre-processing stage, several fashion-specific processes including adding bounding boxes, pose estimation, etc. are performed to ensure effective model training. The data processing results, analytics and insights enable users to improve the dataset, increase the accuracy rate and make the model robust.
At present, fashion users can create projects in image classification and collage generation. Several other project types such as image matching, garment extractor, background detection, pattern generator, commerce predictions, fashion object detection, etc. will soon be available.
How AI Studio Uses a Data-Centric Approach to Fashion ML?
Focuses on Data Quality
Given how data quality has a direct and significant impact on the model’s performance and accuracy, the AI Studio platform takes steps to ensure that only good quality images are used in training. Features such as bounding boxes, multi-category detectors, pose estimation, etc. are capable of ignoring the noise and focusing on only the relevant fashion objects/ parts of the image/ data input.
In the pre-processing stages, images are resized and poor quality images – too small, corrupted, pixelated, etc. – are automatically removed from the dataset. When images have too many clashing fashion products, the objects occupying the larger area are used for training.
The AI Studio platform allows fashion users to keep improving the model’s performance and accuracy through continuous iteration. All datasets can be iterated; users can upload multiple training and evaluation datasets within the same model and build multiple models within a single project.
The pre-processing insights and evaluation metrics empower users in iterating the datasets and strengthening the model. Based on these KPIs, the user can choose the most accurate one to deploy in production.
AI Studio is built on the understanding that model training is only one part of ML projects and extends significant attention to data processing and augmentation.
During the pre-processing stages, poor quality fashion images are automatically removed and there is an option for users to manually remove images based on the pre-processing results. Users can augment the dataset with more images based on these insights.
Upon successful training, the user can test and evaluate the model using evaluation datasets. Based on the key evaluation metrics, analytics and insights, the user can further finetune and augment the dataset to improve the model’s ability to build classifiers and detectors, thereby, boosting its precision and performance.
Since domain experts build models, the problem of labelling noise and labelling inconsistency is avoided. In AI Studio, models are trained using deep learning and custom datasets provided by the user. In other words, the domain expert provides the labels and a training dataset for each label, and the machine learns like a toddler would.
This helps make the models more accurate, high performing and effective in identification and classification of desired fashion objects/ attributes in the images when deployed in the production environment.
Integration Of Domain Knowledge
AI Studio is a DIY AutoML platform that allows fashion users including fashion brands, designers, merchandisers, retailers, marketplaces, e-commerce platforms and so on to build their own production-ready fashion AI models.
Owing to the integration of domain expertise, it is easier to ensure labelling consistency, get data quality assurance and build effective fashion models based on relevant insights and custom data.
Self-Learning Systems in The Product Architecture
Typically, the learning process in Machine learning is supervised. So, if the model is to be trained to identify attributes in a dataset, the programmer would create a detailed list of features, classifiers and detectors. This is laborious and its success depends on the abilities of the programmer.
In deep learning, however, the machine is provided with training data for each of the labels. Using this, the machine creates and learns feature sets for what is and is not a specific attribute. The user does not code each step but allows the system to self-learn. With each iteration, the model becomes more complex, accurate and robust in making predictions for that specific attribute. This is exactly how toddlers learn to identify things. Self-learning systems are an integral part of the AI Studio product architecture.
In the face of high failure rates of Machine Learning projects, the shift to the iterative, collaborative data-centric approach is imperative. With data-centric ML platforms with advanced capabilities like AI Studio, enable fashion users to easily adopt intelligent automation in their everyday functioning and drive real digital transformation while eliminating the risks associated with the model-centric approaches to ML.