Skip to main navigation Skip to search Skip to main content

Data pre-processing techniques and tools for predictive modelling using unstructured inputs

  • Przemyslaw Patryk Maslowski

Student thesis: Master's thesis

Abstract

Data is a crucial factor within machine learning, as most of the neural networks and machine learningmodels are data-driven. A trained neural network can be used to predict new data that has not been seenby the model but under the trained patterns. The performance of the predictive model can vary basedon the data that is being used while training. Multiple metrics have been produced after a model istrained to evaluate model performance. However, it is difficult to get an intuitive measurement thatindicates if the data pre-processing of a model has been improved or not. Therefore, a constructiveperformance indicator tool that can be used to intuitively measure the performance of pre-processingmechanisms for a given model, has been developed through multiple experiments with 32 datasets. Theexperiments are set up by collecting multiple unstructured datasets which are subsequently convertedinto structured datasets and then evaluated by their modelling performance. The experiment results areused to evaluate the importance of each metric and priorities via weights for contextualising the preprocessingexperience within the constructivist paradigm.Furthermore, a set of tools have been developed throughout the project to improve the efficiency ofmachine learning experiments. The developed set of tools are a part of the main software, which isnamed as the pre-processing assistant. The pre-processing assistant has been published to the public,and it can be used for preparing, processing, and analysing data. The software tools allow users tomanipulate datasets and generate Python scripts to train a predictive model. Also, the TensorFlowframework and its machine-learning algorithms have been utilised to develop Python scripts for trainingand predicting datasets. The software has been used to effectively carry out the experiments which havehelped to configure the performance indicator tool.In the end, the most important metrics have been discovered through various experiments. Theexperiments consist of training the model with and without data pre-processing techniques. The increasein each metric has been adopted to discover significant metrics. The metrics which improve frequentlyare estimated to be more critical and have been assigned with a higher weight. The performanceindicator has been configured based on the final experiment results, and it can be used by others tomeasure the performance of a predictive model.
Date of AwardJul 2020
Original languageEnglish
Awarding Institution
  • University of Bedfordshire
SupervisorRenxi Qiu (Supervisor) & Haiming Liu (Second supervisor)

Keywords

  • Data Pre-Processing
  • Machine Learning
  • Supervised Learning
  • Deep Learning
  • Data Analysis
  • Subject Categories::G760 Machine Learning

Cite this

'