In this document, we will play with linear regression and generalized linear models in sklearn. We will predict the price of a real estate based on a number of covariates. The dataset is available here.

Exploratory Analysis

As usual, we will start with some exploratory analysis.

The variable X1 transaction date seems to be in the form of yyyy.mmdd. Let us change it to just the year number and ignore the month and day. We will also treat transaction year as a factor, rather than numeric. Roughly 70% of the houses are sold in 2013, while the remaining are sold in 2012.

We can group by X1 transaction date and calculate some summary statistics of other variables. In particular, the most expensive house price of unit area occurs in year 2013. There is nothing else noriceable for other variables.

We also plot the histogram of house price by the transaction date. It seems that the house price is higher in 2013.

The next variable to consider is X2 house age. We plot its marginal and joint distribution with the response. Nothing particular stands out in this plot.

Similar as above, the same figures can be plotted for X3 distance to the nearest MRT station and X4 number of convenience stores. These two variables are negatively and positively, respectively, correlated with the house price.

Finally, we will look at X5 latitude and X6 longitude together. It does not look good to directly plot the house price on a map, so we opt to bin the latitude and longitude. It seems that real estate prices are higher on the southeast region.

Simple Linear Regression

Now, we conduct a simple linear regression of Y house price of unit area on all other variables. In terms of preprocessing, we only need to convert X1 transaction date into dummy vectors. We will scale the continuous variables using the min_max_scaler.

The linear model only has an R2 of 0.5215 on the training set, and 0.5122 on the testing set, which is not ideal. The fitted coefficients are also shown below.

To visualize the fitted model, let us compare the fitted value with the actual house price.

Gamma GLM

Considering the house price can only be nonnegative, the linear model above is actually not adequate as it allows for negative values. A better model is the Gamma Generalized Linear Mode (GLM) with log link, which is also implemented in the sklearn package.

The score returned by the glm object is based on the null deviance of the model.

Again, we can make a similar visualization of the fitted values by GLM. Compared with the simple linear model, GLM seems to over-fit the tail.