Socially-Engineered Phishing E-mail Detection, Effect of Feature Hashing on Model Accuracy

Fatimah Aloraini


Supervised by Neetesh Saxena; Moderated by Michael Daley

The Internet’s rapid growth has immensely changed the online user experience. Although the Internet has made it convenient for organisations as well as individuals to exchange information, it provides cybercriminals with a medium where malicious behaviors can be conducted. Social engineering attack particularly phishing is one of the most common security threats today. Even though phishing attacks come in different forms, e-mail continues to be the most popular vector for launching phishing attacks because of its significant role in digital communication, irrespective of the field. Phishing e-mails hiding among billions of legitimate e-mails have threatened global security over the last decade. Phishing can affect both individuals and organisations, and the severe consequences of a successful phishing e-mail attack can bring even well-protected organisations to their knees. Despite the increase in anti-phishing solutions, recent statistics from the Anti-Phishing Working Group show that the problem of phishing is far from being solved. The E-mail body (i.e. message) analysis is important in phishing e-mail detection because, since the header is usually hidden from users, this is the part of the e-mail where most social engineering tricks are implemented. Nevertheless, content-based classification comes with the challenge of highly sparse representations of instances (i.e. e-mails), which can become unmanageable in situations with large corpora. Thus, in this study, we present a content-based machine learning model that can distinguish phishing e-mails from legitimate ones and overcome the challenge of sparse representations to a significant extent. The model implements the hashing trick instead of traditional dictionary-based methods to generate vector representations of e-mail bodies. A comparison of five supervised machine learning algorithms—Support Vector Machine, Decision Tree, Naive Bayes, Random Forest, and Logistic Regression—is performed to find the best model for solving the formulated problem. The proposed approach successfully demonstrates that using feature hashing improves the feature extraction process in terms of feature vector size (only 8% of the dataset’s original size) and extraction time (two times faster) compared to previous content-based detection models, without a significant effect on the model detection accuracy.

Final Report (18/09/2020) [Zip Archive]

Publication Form