Assessing Compliance of Web Pages using Machine Learning

Christopher Green


Supervised by Irena Spasic; Moderated by Helen R Phillips

The project will focus on delivering an application capable of crawling a given set of web domains (100+), with the intention of finding pages displaying compliance related data and categorising them as compliant or non-compliant using a combination of machine learning and rule based approaches. Features used for classification will be extracted from the web pages using natural language processing (NLP); in particular the use of named entity recognition and basic information extraction is predicted. Main concepts in the webpages will be formally modelled via a small ontology, in order to support the semantic elements of NLP.

The potential benefit of such a system is to dramatically reduce the manual workload in assuring disparate organisations are displaying data to the required level.

Final Report (23/11/2016) [Zip Archive]

Publication Form