London House Price Prediction

Machine Learning vs Deep Learning for London Property Valuation

Predicting London house prices using HM Land Registry and EPC data with XGBoost achieving R2 of 0.83.

Overview

Predicting house prices accurately in dynamic markets like London poses significant challenges for traditional modelling techniques due to the non-linearity of influencing factors. This project investigates the extent to which machine learning and deep learning models improve short-term prediction accuracy for freehold houses in London.

By merging HM Land Registry Price Paid Data with Energy Performance Certificate datasets, the study overcomes the feature deficiency of transactional records alone. The project follows a complete data-science lifecycle from data acquisition to model deployment.

Data Sources

HM Land Registry PPD, EPC Data (MHCLG)

Property Type

London Freehold Houses

Time Period

2015 - 2024

Methodology

The project follows a rigorous data-science methodology:

  • Data Acquisition: Downloaded and merged HM Land Registry Price Paid Data with Energy Performance Certificate datasets under Open Government Licence v3.0.
  • Data Cleaning: Handled missing values, removed outliers, filtered for London freehold properties only.
  • Feature Engineering: Extracted date features (year, month, weekday, quarter), applied Label Encoding to high-cardinality categorical features (postcodes, street names).
  • Data Split: Used 2015-2023 for training, reserved 2024 as hold-out set to test temporal generalization.
  • Models Evaluated: Linear Regression, Random Forest, XGBoost, MLP (Neural Network), and LSTM (Deep Learning).
  • Deployment: Built an interactive Gradio-based address lookup tool for end-user predictions.

Results

Key findings from the model evaluation:

  • XGBoost achieved R2 of 0.83 on the 2024 hold-out set, outperforming all other models.
  • Tree-based ensembles (XGBoost, Random Forest) significantly outperformed deep learning approaches (LSTM, MLP) on this tabular dataset.
  • Merging EPC data (floor area, habitable rooms, energy rating) with transaction data significantly improved predictive power over transaction data alone.
Model Test R2 2024 Hold-Out R2 Status
XGBoost Regressor 0.82 0.83 Best Model
Random Forest 0.78 0.77 Strong
MLP (Neural Network) 0.75 0.74 Moderate
LSTM (Deep Learning) 0.63 0.64 Weak
Linear Regression 0.55 0.59 Baseline
Business Value: The model can help real estate professionals and buyers understand price drivers, improve valuation accuracy, and make data-driven decisions in the London property market.

Tech Stack

Languages

Python

Data Processing

Pandas, NumPy

Machine Learning

Scikit Learn, XGBoost, TensorFlow, Keras

Visualization

Matplotlib, Seaborn

Deployment

Gradio

Tools

Jupyter, VS Code, GitHub