Abusive language detection against immigrants and women

Ross Singleton


Supervised by Luis Espinosa-Anke; Moderated by David J Humphreys

In this proposal a student will explore computational approaches for modeling and detecting biased and abusive language in social media. There are two broad topics: (1) Offensive language detection, which occurs when individuals take advantage of the perceived anonymity of computer-mediated communication, and engage in behavior that many of them would not consider in real life [1]; and (2) Hate Speech detection. Hate Speech is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics [2].

Both these types of abusive language are pervasive in social media. Projects will be carried out in a controlled environment, i.e., they will use the datasets provided in two current data science competitions (see references). The main goal of a project developed within this proposal is to build a system that works (i.e., which produces output that can be evaluated with the official scorer script provided in the task), to be able to describe the intuition behind the design of the model, and finally how these intuitions made it into the final model/code. Then, this system can be compared in a real-life scenario with the baseline system provided by the organizers of each competition, and with other systems submitted by research groups and companies.

A student may choose to develop a system for either of these competitions, or for both.

[1] https://competitions.codalab.org/competitions/20011 [2] https://competitions.codalab.org/competitions/19935

Initial Plan (31/01/2020) [Zip Archive]

Final Report (14/05/2020) [Zip Archive]

Publication Form