This project presents a machine learning-based platform to predict nanoparticle biodistribution in cancer treatment, with a focus on tumour delivery efficiency. It leverages an ensemble of neural networks and gradient-boosted decision trees (XGBoost) to model how design features—such as size, surface charge, and coating—affect delivery outcomes in preclinical models.
A key innovation is the integration of SciBERT text embeddings to extract semantic features from nanoparticle descriptions, enriching the input space beyond structured numerical data. Interpretability is achieved through SHAP analysis, providing clear insight into how each input feature influences predictions.
To promote global collaboration without compromising data privacy, the platform implements a privacy-preserving model update mechanism, enabling users to fine-tune the model on their own data and submit only weight updates (not raw data) for aggregated improvement.
The system is deployed via a user-friendly Streamlit interface, offering interactive predictions, SHAP visualisations, and a simulation workflow for testing collaborative improvements. The model demonstrates improved predictive accuracy and robustness, with up to 15% reduction in RMSE when federated learning is used.
Key Deliverables
•1. Dataset Preprocessing & Feature Engineering Cleaned and normalised preclinical biodistribution data (%ID in tumour). Extracted structured features (e.g. size, charge, coating) and unstructured text embeddings (SciBERT).
•2. Model Development Neural Network and XGBoost ensemble for predicting %ID. Early stopping and hyperparameter tuning implemented for stability.
•3. Explainability Integrated SHAP analysis for per-sample and global feature attribution. Visualised feature contributions to tumour delivery efficiency.
•4. Web Platform Implementation Streamlit-based interface for uploading, visualising, and testing nanoparticle designs. Real-time feedback with performance metrics and SHAP plots.
•5. Collaborative Learning Framework Weight-sharing mechanism for community retraining without raw data exchange. Inspired by federated learning principles to preserve privacy.
•6. Evaluation and Validation Model evaluated using RMSE, R², and cross-validation. Demonstrated 15% RMSE improvement over When federated learning is used.