In this document, I will continue analyzing the toy dataset (available here). The objective is to conduct a logistic regression and kNN classification.

Revisiting the Dataset & Preprocessing

Recall that our dataset has one invalid entry with Income being less than zero, which will be excluded. In addition, we will conduct some basic data preprocessing.

Now, we would like to treat Illness as the reponse, and all other variables as the covariates. Below we conduct some basic data preprocessing. We then partition the whole dataset based on a 75-25 train-test split.

Notice that the min_max_scaler should only be fitted on the training set in order to prevent data leakage from the testing set. We can check this by looking at the summary statistics of X_train and X_test: the maximum and minimum of the latter are not necessarily 0 and 1, respectively.

Logistic Regression

A simple mode for classfying Illness is the logistic regression, which is readily available in the sklearn package. Since we have augmented dummy variables for all levels of City and Gender, there is no need to include an intercept in our model.

The fitted coefficients can be inspected as follows.

Now, we can also predict the probabilities of Illness on the testing set. It is less informative to predict the exact outcome of Illness, which will be No in most cases considering the average probability of illness is only 8%.

One way to visualize the fitted Logistic regression is to consider the partial dependence plot. Let us first plot the probability of Illness against Age, while averaging out the effects of all other covariates.

We observe that the probability of Illness increases with Age, and the logistic model provides a reasonably well fit on the training set.

A similar plot can be produced for the relationship between Illness and Income. The trend is decreasing overall, but the fit is not ideal on the tail (i.e. individuals with large Income). This is not surprising: Recall the histogram of Income which shows the majority of individuals having income around its mean and median, which the model tries to fit the most.

For the discrete covariates, we can also make similar plots. Let us first look at City. It seems that the probabilities of Illness are quite well fitted.

A similar plot can be produced for Gender. The Illness probability for female is severely under-fitted, which is probably due to the relative size of male (56%) and female (44%) in the dataset.

kNN Classification

Next, we move to a non-probablisitic approach - k-Nearest Neighbour - for classification of Illness. The kNN model is also readily available in the sklearn package. Let us try the standard setting with k=5.

We can take a look at the fitted model on the training and testing set. It seems that the proportion of Illness on the training set is severely underfitted when k=5. On the testing set, the fitted accuracy is around 0.9163.

To conduct a model selection, i.e. select the optimum number of neighbours k, we will try different values of k and determine the best one that maximizes the testing score. For simplicity, we will skip cross validation in this document (although this would be much more robust).

We see that the test score increases with k and levels off after k=6, which may be considered a reasonable choice of number of neighbours for this particular train-test split.