[PDF]

Personalized News Recommendation System Based on the MIND Dataset


Haoran Sun

08/09/2025

Supervised by Irena Spasic; Moderated by Dr Daniel J. Finnegan

Kaggle data resource link: https://www.kaggle.com/datasets/arashnic/mind-news-dataset

Background I used to work as a product manager in an internet company, responsible for the company's user reach system, recommending appropriate messages to users. In our recommendation system, we found that our rule-based recommendation is more effective than algorithmic recommendation after experimentation, that is, at the beginning of the recommendation system, we tried to introduce algorithms, but the effect is not good at the beginning. So here I hope to continue the research, combined with the news content and user clicking behaviour, can provide users with more accurate and customized information services. MIND dataset provides a large number of news text and user click logs, which provides ideal data support for building an efficient recommendation system. I learned Python, Java, and some algorithms in Cardiff. Although this is far from enough, I am still willing to learn more because it is my interest aspect.

Specific Aims and Objectives Aim: To construct and validate a personalised recommendation model based on news content and user clicking behaviour. Objectives: 1. Analyse the MIND dataset to extract news text features and user behaviour data; 2. Adopt NLP techniques (e.g., TF-IDF or pre-trained models) for news text representation; 3. Design and train a recommendation model that incorporates user interests and news features; 4. Evaluate the effectiveness of the model using metrics such as click rate, precision rate, recall rate and nDCG.

Proposed Method (Solution) 1.Data preprocessing: Extract user click logs and news text, clean, segment and vectorise the text; 2. Feature extraction: Use NLP methods to convert news into semantic vectors and construct a user interest model (e.g., by aggregating the vector representation of their clicks on news); 3. model design: build a hybrid recommender system, which can combine collaborative filtering and content recommendation methods, or directly use deep learning models to predict the probability of users clicking on the news; 4. Training strategy: Use loss functions such as cross-entropy to train the model, and optimise the model parameters through cross-validation.

Evaluation Overview 1. Evaluate the model using training/test set partitioning or cross-validation; 2. The main evaluation metrics include click-through rate (CTR), precision rate, recall rate, and nDCG to measure the performance of the recommender system in different user groups and news categories.


Final Report (08/09/2025) [Zip Archive]

Publication Form