Semantic Search using Large Language Models [MULTIPLE PROJECTS]

Yoshinobu Katayama


Supervised by Alun D Preece; Moderated by Paul L Rosin

This project is hosted by the Security Crime and Intelligence Innovation Institute. We are interested in efficient and effective ways of discovering data by semantic search and matching. Semantic matching involves taking into account the meaning of information rather than just it's syntactic structure. Semantically, a "river bank" is very different from a "money bank" even though syntactically they both include the word "bank". Modern natural language processing (NLP) uses large language models (LLMs) such as BERT and GPT to determine semantic similarity. There are multiple possible projects here, including: -- Using semantic search to find passages in a document or posts on social media by similarity to an example passage or post; -- Using semantic matching to perform location-based searches where locations are expressed in natural language (potentially in multiple languages, e.g., Caerdydd vs Cardiff); -- Using semantic matching for structured data such as JSON or XML.

The project will initially make use of existing NLP methods and LLMs, but there is potential (in advanced variants of the project) to develop new methods, e.g., by fine-tuning existing LLMs.

Students will be supported by members of the Security Crime and Intelligence Innovation Institute team in the sbarc|spark building, and be able to work as peers alongside other students doing variants of this topic.

This project is suitable for students who have studied machine learning and NLP; however, the topic is also open to conversion MSc students who have a willingness to learn NLP methods.

Final Report (07/09/2023) [Zip Archive]

Publication Form