This is a house price prediction study which utilized Exploratory Data Analysis, Dealing with Missing Values, Linear Regression with LASSO and Ridge regularization to predict house prices in the Ames Housing Data Set
Case study: Feature Engineering-- Ames house price prediction

1 Problem statement

In this case study, you will prepare Ames Housing Dataset in a csv file in a way that it is suitable for a ML algorithm. You will achieve this by first exploring the data and performing feature transformations on provided dataset of house price prediction ML problem. You are required to train a ML model by using linear regression, ridge regression and lasso regression for predicting house prices.

2 Steps

  • 2.1 Load data set
  • 2.2 Exploratory Data Analysis (EDA)
    1. Histograms
    1. Heatmap
    1. Scatterplots


    1. Scatter matrix


    1. Correlation between other features and 'SalePrice'

The target 'SalePrice' variable is highly correlated with features such as OverallQual, GrLivArea, GarageCars, GarageArea and TotalBsmtSF among others.

  • 2.3 Process dataset for ML


    1. Handle missing values
    1. Fill nulls for 'LotFrontage' with median value calculated after grouping by 'Neighborhood'
    1. Fill nulls for 'GarageYrBlt','MasVnrArea' with 0
    1. Apply log-transform on target feature 'SalePrice'
    1. One-hot encoding

3 Train Linear Regression

Split dataset in training set (X_train, y_train) and test set (X_test, y_test)

4 Evaluate Linear Regression model

R^2 score on trainig set: 0.94609, MSE score on trainig set: 0.00808

R^2 score on test set: 0.89136, MSE score on test set: 0.01472


5 Model refinement with Ridge regression and Lasso regression

Ridge regression (alpha=0.05): R^2 score on training set: 0.94598, R^2 score on test set: 0.89410

Lasso regression (alpha= 0.0001): R^2 score on trainig set: 0.94169, R^2 score on test set: 0.90843

6 Conslusion:

6.1 In practice, ridge regression is usually the first choice between two models.

6.2 However, if you have a large amount of features and expect only a few of them to be important, Lasso might be a better choice.

R^2 score Linear Regression Ridge Regression Lasso Regression
training set 0.94609 0.94598 0.94169
test set 0.89136 0.89410 0.90843
