Quantifying the different COVID-19 variants present in wastewater within South Wales

Arlyn Miles


Supervised by Bailin Deng; Moderated by Helen R Phillips

The aim of this project is to identify quantify the different variants of COVID-19 present in waste water samples collected within South Wales.

I have worked with Professor Peter Kille (Director of Technology and Bio-Initiatives at the School of Biosciences) as my client, collaborating with the ongoing research into this subject. The data I have used is sequenced RNA data from wastewater and individual samples from online databases. I have created an optimised pipeline to process this sequenced data and identify the different COVID-19 variants present. A command-line-interface tool has been developed to work with this pipeline to configure and generate reports from the results.

The pipeline is intended to run on the Cardiff University School of Biosciences Trinity Cluster, a high performance computing cluster ideal for processing bioinformatics pipelines due to the large size of the data I am working with. In this project I have therefore utilised the processing power available and kept computational efficiency in mind when developing this pipeline.

New strains of COVID-19 are spreading throughout the population, and as a virus it will continue to mutate. New variants can have significant effects; mutations on the spike protein on variant 501.V2 also known as the “South African variant” may increase its infectivity. It is therefore important to monitor the different COVID-19 variants present within local communities.

RNA of COVID-19 is present in faecal matter and can therefore be detected within wastewater. It is thus possible to quantify the variants circulating within the wider South Wales population by sequencing viral RNA from community wastewater samples

The result of this pipeline produces an allelic frequency of how common each variant present is in a wastewater sample, which can be used in further research to estimate the number of individuals infected with a given strain containing many variants. Further statistics about the data such as sequenced coverage across the COVID-19 genome is provided, which is key for researchers to verify the accuracy of the results.

This project shows great promise for aiding the continual monitoring of COVID-19 infections and mutations at a cost-effective population-wide scale.

Initial Plan (07/02/2021) [Zip Archive]

Final Report (28/05/2021) [Zip Archive]

Publication Form