Overview
Predicting house prices accurately in dynamic markets like London poses significant challenges for traditional modelling techniques due to the non-linearity of influencing factors. This project investigates the extent to which machine learning and deep learning models improve short-term prediction accuracy for freehold houses in London.
By merging HM Land Registry Price Paid Data with Energy Performance Certificate datasets, the study overcomes the feature deficiency of transactional records alone. The project follows a complete data-science lifecycle from data acquisition to model deployment.
Data Sources
HM Land Registry PPD, EPC Data (MHCLG)
Property Type
London Freehold Houses
Time Period
2015 - 2024
Methodology
The project follows a rigorous data-science methodology:
- Data Acquisition: Downloaded and merged HM Land Registry Price Paid Data with Energy Performance Certificate datasets under Open Government Licence v3.0.
- Data Cleaning: Handled missing values, removed outliers, filtered for London freehold properties only.
- Feature Engineering: Extracted date features (year, month, weekday, quarter), applied Label Encoding to high-cardinality categorical features (postcodes, street names).
- Data Split: Used 2015-2023 for training, reserved 2024 as hold-out set to test temporal generalization.
- Models Evaluated: Linear Regression, Random Forest, XGBoost, MLP (Neural Network), and LSTM (Deep Learning).
- Deployment: Built an interactive Gradio-based address lookup tool for end-user predictions.
Results
Key findings from the model evaluation:
- XGBoost achieved R2 of 0.83 on the 2024 hold-out set, outperforming all other models.
- Tree-based ensembles (XGBoost, Random Forest) significantly outperformed deep learning approaches (LSTM, MLP) on this tabular dataset.
- Merging EPC data (floor area, habitable rooms, energy rating) with transaction data significantly improved predictive power over transaction data alone.
| Model | Test R2 | 2024 Hold-Out R2 | Status |
|---|---|---|---|
| XGBoost Regressor | 0.82 | 0.83 | Best Model |
| Random Forest | 0.78 | 0.77 | Strong |
| MLP (Neural Network) | 0.75 | 0.74 | Moderate |
| LSTM (Deep Learning) | 0.63 | 0.64 | Weak |
| Linear Regression | 0.55 | 0.59 | Baseline |
Tech Stack
Languages
Python
Data Processing
Pandas, NumPy
Machine Learning
Scikit Learn, XGBoost, TensorFlow, Keras
Visualization
Matplotlib, Seaborn
Deployment
Gradio
Tools
Jupyter, VS Code, GitHub
Links
Explore the project code, notebooks, and full report: