Identifying narrative text on Reddit

Adriano Sole


Supervised by Steven Schockaert; Moderated by Philipp Reinecke

Neural network models for natural language processing are typically pre-trained on large text collections such as Wikipedia. This allows these models to learn word knowledge without the need for an explicit supervision signal. However, the kind of knowledge they can learn in this way crucially depends on the type of text collection that is used. For instance, Wikipedia is a common choice. By pre-training models on encylopedic text, they can acquire a lot of factual knowledge about the world. On the other hand, training models on narrative text (e.g. books or movie scripts) can be a better choice if learning commonsense knowledge is the main goal.

In recent years, the social media site Reddit is increasingly being used for training natural language processing models. However Reddit contains documents covering a broad range of genres, including expository, narrative and argumentative text. The aim of this project is to develop a method for identifying narrative documents in a large collection of Reddit posts.

Initial Plan (07/02/2021) [Zip Archive]

Final Report (12/05/2021) [Zip Archive]

Publication Form