Aligning generative transformer models with fine-tuning and supervision

Mohammed Fayaz Ansar Jelani


Supervised by Yuhua Li; Moderated by Yazmin Ibanez Garcia

This research project centres on the issues and possible solutions related to the utilization of generative transformer models (GTM) for creating generative chatbots. Although GTMs offer dynamic and engaging conversations, they also present challenges in terms of controlling responses and the risk of malicious attacks. The objective of this research is to align GTMs with core business applications and prevent deviations from the intended functionality. The research goals encompass fine-tuning GTMs with business-specific data and updating supervision models to counter prompt-based attacks. The findings demonstrate that fine-tuning with incremental prompts produces favourable outcomes, albeit at a higher cost and with less efficiency when compared to using pre-trained models. The study also assesses the effectiveness of inbound and outbound supervision models in countering jailbreak and data leakage attacks. The outbound supervision model is successful in mitigating jailbreak and data leakage attacks, whereas the inbound supervision model proves to be ineffective. However, there are limitations related to interference of outbound supervision model to chatbot’s intended functionalities. Future work involves further refining the outbound supervision prompts and exploring the possibility of fine-tuning instruct pre-trained models as an alternative to base models. This research offers recommendations for organizations launching new chatbots in emerging areas. In summary, the aim of this research is to bridge the gap between the potential of GTMs and the practical challenges they present, enabling organizations to deploy chatbots that are effective and resilient against malicious attacks.

Final Report (11/09/2023) [Zip Archive]

Publication Form