Kaggle Housing Prices project — ML pipeline with preprocessing, feature engineering, and regression models
This project is based on the Kaggle "House Prices: Advanced Regression Techniques" dataset.
The goal is to predict house sale prices using machine learning models and feature engineering.
- Source: Kaggle House Prices Challenge
- Includes features such as lot size, year built, number of rooms, and neighborhood.
-
Data Preprocessing
- Missing value imputation
- Feature scaling and encoding
- Handling multicollinearity
-
Feature Engineering
- Created new features (e.g., age of house, total square footage)
- One-Hot Encoder and Ordinal Encoding together inside a ColumnTransformer / Pipeline is done because different types of categorical features transformation.
- Selected important predictors
-
Modeling
- Tried Linear Regression, Random Forest, and XGBoost
- Evaluated using RMSE
-
Results
-
Best model: The RandomForestRegressor achieved a test RMSE of 0.14 in the log(Sale Price) scale, which translates to an average prediction error of approximately 15% in the original sale price units. For example, for a property valued at £200,000, the model’s prediction would typically be within ±£30,000 of the actual price. The R² score of 0.87 indicates that the model explains 87% of the variability in housing prices, demonstrating strong predictive performance.
-
Insights: Feature importance shows neighborhood and square footage as key drivers.
-
- Clone the repo:
git clone <https://github.com/Adibab/Housing-Price-Prediction.git>