In this document, I will conduct a preliminary analysis on a toy dataset (illness) available on Kaggle (here). The objective is to prepare for a data scientist interview session that involves on-the-spot data analysis using Python.

Data Loading and First Look

The first step is to load data and do some basic cleaning. There are 150,000 observations. We will treat Illness as a response variable which is related to covariates including City, Gender, Age and Income.

We take a brief look at the summary statistics of the continuous variables. Clearly, negative Income is invalid and we will exclude these data points (actually only one) without any additional information.

We can directly plot the histogram of continuous variables using a chaining method. It seems that Age is relatively uniformly distributed between 25 and 65. The distribution of Income is trimodel and lives on a different scale compared with Age. It might be useful to do some data transformation (in this case, taking log) so the covariates have roughly the same scale.

Next, we turn to the discrete variables. We first turn them into factors and check their unique values. The marginal distributions are also plotted.

More Plotting

Since we will treat Illness as the response, it may also be informative to plot it against the covariates. Essentially, we are plotting a discrete variable against continuous and discrete variables, which is easily done using the plotting utilities in Python. In addition, we also tabulate Illness against discrete covariates City and Gender for more insights.

Next, we will tablulate the response Illness against discrete covariates.

To test whether Illness differs significantly by Gender, we can conduct a simple chi-squared test on the contingency table above. The p-value is only 0.61, which indicates there is not a significant difference.

The same analysis goes for the pair Illness and City. The p-value also suggest the relationship between Illness and City is insignificant.