Text mining is the process of analysing written language by discovering patterns and consistencies in text. It is a powerful tool that extracts important information from texts, which in turn can be used for different purposes.
In this project text mining was used to analyse song texts. Lyrics belong to an area of language well defined by rules and rhetorical devices. Due to their relatively structured nature it was possible to apply various text mining algorithms to them. The algorithms used were n-gram modelling, unsupervised clustering using k-means and supervised clustering using k nearest neighbour.
The project resulted in two main functions. The first function generates part of a song verse which is to be completed by the user, who selects a word from a predefined set of words generated by the function. The testing was done on three sets of data varying in size. The results showed that there were small changes in the success rate of the verses based on different sized corpora. The number of unsuccessful cases were found to be the lowest in a dataset of medium size (around 250 documents per genre). A larger change was observed in the results for semantically successful cases, where the success rate increased relative to data size. No significant change was observed in the case of grammatically correct verses across the different sized data.
The second function was based on an algorithm performing unsupervised and supervised clustering. The results showed that unsupervised clustering proved unsuccessful in accurately classifying given lyrics, proving that supervised clustering is a more powerful tool when performing text categorisation.