A comprehensive evaluation of artificially intelligent learning in the detection of phishing emails

Hossein Ramezanian


Supervised by Neetesh Saxena; Moderated by Shancang Li

Phishing email attacks are potent and dangerous by design. Many companies, industries and individuals have fallen victim to it. Advancements have been made to combat these attacks; however, the technology does not stop a significant portion of these attacks. This could be for many reasons, but it could be attributed to the nature of well-made phishing emails, as it mimics the textual behaviour a regular client or friend would follow. Many researchers tried to apply AI to solve this problem. The research was either done on URL-based information such as header and email metadata or on the textual body of emails. Few studies focused on carrying a comprehensive analysis of various AI algorithms, techniques and methods for the textual part of spam emails. This project saw an opportunity to study the textual patterns attackers might follow in order to correctly classify and detect phishing emails. Ultimately, the scope of it has been set to generate a comprehensive experiment and evaluation of ML and DL algorithms, feature extraction methods and AI techniques in detecting phishing emails. The Enron data set has been used to train and evaluate the algorithms. In contrast, a dedicated python programme has facilitated the pipeline for the solution, DT, LR, NB, and SVM algorithms, each of which is trained on Count vectorizer, TF-IDF character level and TF-IDF unigram-level feature extraction methods. These algorithms have been trained with their default parameters, using cross-validation and hyperparameter tuning. Furthermore, CNN, LSTM, GRU, BI-RNN and RCNN were trained against two different pre-trained word embeddings. While also applying fine-tuning to find the best epoch sizes. The best performance in the ML classifiers was noted in both SVM and LR algorithms, where these algorithms scored an average of ~96% f1. Cross-validated training of the algorithms concluded that DT was overfitting, this could possibly be solved by applying pruning techniques. Hyper parameter tuning boosted the performance of classifiers, specifically in the case of NB. The best performance in the DL phase of the project was the CNN algorithm with a ~98.5% f1; however, all the used algorithms performed well. Using a bigger pre-trained word embeddings as a feature resulted in performance gains of around 4%. Ultimately the experiments concluded that DL algorithms outperform ML algorithms. However, SVM and LR could be quick alternatives to deep learning algorithms.

Final Report (19/10/2022) [Zip Archive]

Publication Form