Comparative Analysis of Tree-Based Machine Learning Models and Neural Networks for Malware Prediction

Kriti Shewaramani


Supervised by Yuhua Li; Moderated by Sandy Gould

In the age of technology, malware attacks are occurring every day around the world. These attacks involve a malicious software that can lock up essential files, spam you with ads, or redirect you to malicious websites which can result in anything from data theft to the destruction of entire systems or devices. Cybercriminals use different types of malware like trojans, ransomware, spyware, or worms to infect individuals or organizations. Machine learning is a credible technology in today’s day and age, the concepts of machine learning can be applied to the process of malware detection in order to efficiently detect and prevent malware activities. The project aims to apply machine learning to predict a computer’s probability of getting infected by various families of malware, based on different properties of that machine using three types of machine learning models; namely, LightGBM, XGBoost, and Neural Network. These models are trained on a dataset published by Microsoft of Kaggle with over 80 features from the reports from Windows Defender. Before implementing the models, data pre-processing, feature engineering and exploratory data analysis were carried out. Once implemented successfully these models then find similarities and patterns between the data to perform classification. Python programming language and Jupyter Notebook were used during the entire duration of the project.

Initial Plan (07/02/2022) [Zip Archive]

Final Report (19/05/2022) [Zip Archive]

Publication Form