The project aims to experiment with a technique inspired by Brute-force Attack methods from the cybersecurity field, which can be applied in the context of datasets and training for binary image classification. This project implements an AutoML pipeline that performs training and evaluation on an exhaustive set of model-dataset pairings to determine the best performing ones, drawing comparisons to published literature and traditional approaches to ML development.
The primary goal of this project is the learning of machine learning fundamentals, as well as the use of this project as an example for the benefit of other AI developers and researchers for using a systematic approach such as brute force to study the behaviours exhibited by the models throughout dataset combinations, as well as the potential problems that may be discovered from the use of such a pipeline. The pipeline was designed to be developer-friendly by allowing users to configure the pipeline with brute-force parameters in a single Python script.
The proposed pipeline takes user-defined models and datasets for classification, and can be re-purposed to any other binary image classification problem. Several metrics such as accuracy and F1-Score would be used to compare the performances of each model generated therein. The model performance data extracted from this pipeline would then be used to look for potential performance gains from specific dataset-combo to model pairings in the form of aggregated metric specific charts and a heatmap pivot table of dataset-model pairings.
The pipeline consists of several parts, starting from pre-processing, parameter setup, batch generator/dataset creation, search space definition, batch training, evaluation and aggregation of metrics and other relevant data across training runs of all models into a single massive tabulated dataframe. This dataframe is then aggregated to generate readable performance charts for all models.
A series of observations, literature and tutorial-guided reasoning as well as reflective learning commentary based on the runs throughout have been included in the paper.
N.B: This is my very first endeavour into machine learning development and thus represents a fairly challenging undertaking as a learning step in self-development, whilst there may be some flaws in this approach that domain experts may point out, this approach was based on my own intuition from previous tech experience and is not aiming at exceeding the results that more seasoned data scientists and engineers may yield using more conventional methods including feature selection and custom architectures. The flaws of this approach have been researched and discussed to the best of my knowledge.