In this document, I will conduct a preliminary analysis on a toy dataset (illness) available on Kaggle (here). The objective is to prepare for a data scientist interview session that involves on-the-spot data analysis using Python.

In [33]:

```
# The usual setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
```

The first step is to load data and do some basic cleaning. There are 150,000 observations. We will treat `Illness`

as a response variable which is related to covariates including `City`

, `Gender`

, `Age`

and `Income`

.

In [34]:

```
# Read dataset
df = pd.read_csv("toy_dataset.csv", index_col = 0)
```

In [35]:

```
df.shape
```

Out[35]:

(150000, 5)

In [36]:

```
df.head()
```

Out[36]:

City | Gender | Age | Income | Illness | |
---|---|---|---|---|---|

Number | |||||

1 | Dallas | Male | 41 | 40367.0 | No |

2 | Dallas | Male | 54 | 45084.0 | No |

3 | Dallas | Male | 42 | 52483.0 | No |

4 | Dallas | Male | 40 | 40941.0 | No |

5 | Dallas | Male | 46 | 50289.0 | No |

We take a brief look at the summary statistics of the continuous variables. Clearly, negative `Income`

is invalid and we will exclude these data points (actually only one) without any additional information.

In [37]:

```
# Describe continuous variables
df.describe()
```

Out[37]:

Age | Income | |
---|---|---|

count | 150000.000000 | 150000.000000 |

mean | 44.950200 | 91252.798273 |

std | 11.572486 | 24989.500948 |

min | 25.000000 | -654.000000 |

25% | 35.000000 | 80867.750000 |

50% | 45.000000 | 93655.000000 |

75% | 55.000000 | 104519.000000 |

max | 65.000000 | 177157.000000 |

In [38]:

```
# Invalid data: negative income
df[df['Income']<0]
# Remove invalid
df = df[df['Income'] >= 0]
```

In [39]:

```
df.shape
```

Out[39]:

(149999, 5)

We can directly plot the histogram of continuous variables using a chaining method. It seems that `Age`

is relatively uniformly distributed between 25 and 65. The distribution of `Income`

is trimodel and lives on a different scale compared with `Age`

. It might be useful to do some data transformation (in this case, taking `log`

) so the covariates have roughly the same scale.

In [40]:

```
# Plot Age
df['Age'].hist(bins = 100)
plt.title("Histogram of Age")
```

Out[40]:

Text(0.5, 1.0, 'Histogram of Age')

In [41]:

```
# Plot Income
df['Income'].hist(bins = 100)
plt.title("Histogram of Income")
```

Out[41]:

Text(0.5, 1.0, 'Histogram of Income')

In [42]:

```
# Transform Income
df['Income'] = np.log(df['Income'])
df['Income'].describe()
# Plotting
df['Income'].hist(bins = 100)
plt.title("Histogram of log(Income)")
```

Out[42]:

Text(0.5, 1.0, 'Histogram of log(Income)')

Next, we turn to the discrete variables. We first turn them into factors and check their unique values. The marginal distributions are also plotted.

In [43]:

```
# Convert discrete data in to factors
for x in ['City', 'Gender', 'Illness']:
df[x] = df[x].astype('category')
print(df[x].unique())
```

In [44]:

```
# Plot City
sns.countplot(data = df, x = 'City')
plt.title('Bar Plot of City')
plt.xticks(rotation = 45)
```

Out[44]:

(array([0, 1, 2, 3, 4, 5, 6, 7]), [Text(0, 0, 'Austin'), Text(1, 0, 'Boston'), Text(2, 0, 'Dallas'), Text(3, 0, 'Los Angeles'), Text(4, 0, 'Mountain View'), Text(5, 0, 'New York City'), Text(6, 0, 'San Diego'), Text(7, 0, 'Washington D.C.')])

In [45]:

```
# Plot Gender
sns.countplot(data = df, x = 'Gender')
plt.title('Bar Plot of Gender')
```

Out[45]:

Text(0.5, 1.0, 'Bar Plot of Gender')

In [46]:

```
# Plot Illness
sns.countplot(data = df, x = 'Illness')
plt.title('Bar Plot of Illness')
```

Out[46]:

Text(0.5, 1.0, 'Bar Plot of Illness')

Since we will treat `Illness`

as the response, it may also be informative to plot it against the covariates. Essentially, we are plotting a discrete variable against continuous and discrete variables, which is easily done using the plotting utilities in Python. In addition, we also tabulate `Illness`

against discrete covariates `City`

and `Gender`

for more insights.

In [47]:

```
# Summary statistics of Illness
df['Illness'].value_counts() / df.shape[0]
```

Out[47]:

No 0.919079 Yes 0.080921 Name: Illness, dtype: float64

In [48]:

```
# Illness vs. Age
sns.histplot(df, x = 'Age', hue = 'Illness', bins = 100)
plt.title("Histogram of Age by Illness status")
```

Out[48]:

Text(0.5, 1.0, 'Histogram of Age by Illness status')

In [49]:

```
sns.violinplot(data = df, x = 'Illness', y = 'Age')
plt.title("Violin Plot of Age by Illness status")
```

Out[49]:

Text(0.5, 1.0, 'Violin Plot of Age by Illness status')

In [50]:

```
# Illness vs. log(Income)
sns.histplot(df, x = 'Income', hue = 'Illness', bins = 100)
plt.title("Histogram of log(Income) by Illness status")
```

Out[50]:

Text(0.5, 1.0, 'Histogram of log(Income) by Illness status')

In [51]:

```
sns.violinplot(data = df, x = 'Illness', y = 'Income')
plt.title("Violin Plot of log(Income) by Illness status")
```

Out[51]:

Text(0.5, 1.0, 'Violin Plot of log(Income) by Illness status')

Next, we will tablulate the response `Illness`

against discrete covariates.

In [52]:

```
# Illness vs. Gender
pd.crosstab(df['Gender'], df['Illness'], normalize = True, margins = True)
```

Out[52]:

Illness | No | Yes | All |
---|---|---|---|

Gender | |||

Female | 0.405796 | 0.035534 | 0.44133 |

Male | 0.513283 | 0.045387 | 0.55867 |

All | 0.919079 | 0.080921 | 1.00000 |

To test whether `Illness`

differs significantly by `Gender`

, we can conduct a simple chi-squared test on the contingency table above. The p-value is only 0.61, which indicates there is not a significant difference.

In [53]:

```
# A chi-squared test
stats.chi2_contingency(pd.crosstab(df['Gender'], df['Illness']))
```

Out[53]:

(0.25259927807912536, 0.6152507634973752, 1, array([[60842.14120761, 5356.85879239], [77018.85879239, 6781.14120761]]))

The same analysis goes for the pair `Illness`

and `City`

. The p-value also suggest the relationship between `Illness`

and `City`

is insignificant.

In [54]:

```
# Illness vs. City
pd.crosstab(df['City'], df['Illness'], normalize = True, margins = True)
```

Out[54]:

Illness | No | Yes | All |
---|---|---|---|

City | |||

Austin | 0.075207 | 0.006740 | 0.081947 |

Boston | 0.050767 | 0.004573 | 0.055340 |

Dallas | 0.120627 | 0.010747 | 0.131374 |

Los Angeles | 0.197368 | 0.017120 | 0.214488 |

Mountain View | 0.086941 | 0.007853 | 0.094794 |

New York City | 0.308575 | 0.026807 | 0.335382 |

San Diego | 0.029914 | 0.002627 | 0.032540 |

Washington D.C. | 0.049680 | 0.004453 | 0.054134 |

All | 0.919079 | 0.080921 | 1.000000 |

In [55]:

```
stats.chi2_contingency(pd.crosstab(df['City'], df['Illness']))
```

Out[55]:

(2.927681297686849, 0.8916107023609688, 7, array([[11297.32472883, 994.67527117], [ 7629.27860186, 671.72139814], [18111.3798492 , 1594.6201508 ], [29569.54348362, 2603.45651638], [13068.39084927, 1150.60915073], [46236.13042087, 4070.86957913], [ 4486.02684685, 394.97315315], [ 7462.9252195 , 657.0747805 ]]))