Linear Regression on Boston Housing Dataset

阿新 • • 發佈：2018-12-28

Linear Regression on Boston Housing Dataset

In my previous blog, I covered the basics of linear regression and gradient descent. To get hands-on linear regression we will take an original dataset and apply the concepts that we have learned.

We will take the Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

So let’s get started.

First, we will import the required libraries.

Next, we will load the housing data from the scikit-learn library and understand it.

We print the value of the boston_dataset to understand what it contains. print(boston_dataset.keys()) gives

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

data: contains the information for various houses
target: prices of the house
feature_names: names of the features
DESCR: describes the dataset

To know more about the features use boston_dataset.DESCR The description of all the features is given below:

CRIM: Per capita crime rate by townZN: Proportion of residential land zoned for lots over 25,000 sq. ftINDUS 
: Proportion of non-retail business acres per townCHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)NOX: Nitric oxide concentration (parts per 10 million)RM: Average number of rooms per dwellingAGE: Proportion of owner-occupied units built prior to 1940DIS: Weighted distances to five Boston employment centersRAD: Index of accessibility to radial highwaysTAX: Full-value property tax rate per $10,000PTRATIO: Pupil-teacher ratio by townB: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by townLSTAT: Percentage of lower status of the populationMEDV: Median value of owner-occupied homes in $1000s

The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature variables based on which we will predict the value of a house.

We will now load the data into a pandas dataframe using pd.DataFrame. We then print the first 5 rows of the data using head()

We can see that the target value MEDV is missing from the data. We create a new column of target values and add it to the dataframe.

Data preprocessing

After loading the data, it’s a good practice to see if there are any missing values in the data. We count the number of missing values for each feature using isnull()

However, there are no missing values in this dataset as shown below.

Exploratory Data Analysis

Exploratory Data Analysis is a very important step before training the model. In this section, we will use some visualizations to understand the relationship of the target variable with other features.

Let’s first plot the distribution of the target variable MEDV. We will use the distplot function from the seaborn library.

We see that the values of MEDV are distributed normally with few outliers.

Next, we create a correlation matrix that measures the linear relationships between the variables. The correlation matrix can be formed by using the corr function from the pandas dataframe library. We will use the heatmap function from the seaborn library to plot the correlation matrix.

The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When it is close to -1, the variables have a strong negative correlation.

Observations:

To fit a linear regression model, we select those features which have a high correlation with our target variable MEDV. By looking at the correlation matrix we can see that RM has a strong positive correlation with MEDV (0.7) where as LSTAT has a high negative correlation with MEDV(-0.74).
An important point in selecting features for a linear regression model is to check for multi-co-linearity. The features RAD, TAX have a correlation of 0.91. These feature pairs are strongly correlated to each other. We should not select both these features together for training the model. Check this for an explanation. Same goes for the features DIS and AGE which have a correlation of -0.75.

Based on the above observations we will RM and LSTAT as our features. Using a scatter plot let’s see how these features vary with MEDV.

Observations:

The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.
The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.

Preparing the data for training the model

We concatenate the LSTAT and RM columns using np.c_ provided by the numpy library.

Splitting the data into training and testing sets

Next, we split the data into training and testing sets. We train the model with 80% of the samples and test with the remaining 20%. We do this to assess the model’s performance on unseen data. To split the data we use train_test_split function provided by scikit-learn library. We finally print the sizes of our training and test set to verify if the splitting has occurred properly.

(404, 2) (102, 2)(404,)(102,)

Training and testing the model

We use scikit-learn’s LinearRegression to train our model on both the training and test sets.

Model evaluation

We will evaluate our model using RMSE and R2-score.

The model performance for training set -------------------------------------- RMSE is 5.6371293350711955 R2 score is 0.6300745149331701

The model performance for testing set -------------------------------------- RMSE is 5.137400784702911R2 score is 0.6628996975186952

This is good to start with. In the upcoming blogs, we will look at ways to increase the model’s performance.

The complete Jupyter Notebook can be found here.

Conclusion

In this story, we applied the concepts of linear regression on the Boston housing dataset. I would recommend to try out other datasets as well.

Here are a few places you can look for data

Thanks for Reading!!

In the next part of this series, we will cover Polynomial Regression. Do watch this space for more.

Linear Regression on Boston Housing Dataset

Linear Regression on Boston Housing Dataset

Data preprocessing

Exploratory Data Analysis

Observations:

Observations:

Preparing the data for training the model

Splitting the data into training and testing sets

Training and testing the model

Model evaluation

Conclusion

Linear Regression on Boston Housing Dataset

3-----A Forcast for Bicycle Rental Demand Based on Random Forests and Multiple Linear Regression

二、Linear Regression 練習（轉載）

Ng第二課：單變量線性回歸(Linear Regression with One Variable)

從零單排入門機器學習：線性回歸（linear regression）實踐篇

ufldl學習筆記與編程作業：Linear Regression（線性回歸）

機器學習基石筆記-Lecture 9 Linear regression

機器學習經典算法具體解釋及Python實現--線性回歸（Linear Regression）算法

Machine Learning - week 2 - Multivariate Linear Regression

線性回歸 Linear Regression

rhadoop linear regression 問題

機器學習六--回歸--簡單線性回歸Simple Linear Regression

機器學習七--回歸--多元線性回歸Multiple Linear Regression

linear regression and logistic regression with pytorch

Linear Regression with Scikit Learn

ML:單變量線性回歸（Linear Regression With One Variable）

coursera 機器學習 linear regression 線性回歸的小項目

在python中實現線性回歸（linear regression）

機器學習實戰（七）線性迴歸（Linear Regression）

Least Squares Method in Linear Regression

Linear Regression on Boston Housing Dataset

Linear Regression on Boston Housing Dataset

Data preprocessing

Exploratory Data Analysis

Observations:

Observations:

Preparing the data for training the model

Splitting the data into training and testing sets

Training and testing the model

Model evaluation

Conclusion

相關推薦