Spark's Work

@formula Macro in StatsModels.jl for Model Matrices

A noticeable shortcoming of the LRMoE.jl package, as of version 0.3.2, is the lack of support for a formula interface.

For example, when fitting (G)LM in R, one can simply call the following without worrying about the data types of the columns. This can be particularly handy when x2 is categorical (e.g. with levels A, B and C).

lm_r = lm(y ~ x1 + x2, data = df)

This is the same case with GLM.jl in Julia (see here).

lm_julia = GLM.lm(@formula(y ~ x1 + x2), df)

Whenever I wanted to fit an LRMoE on the same df, I used to manually code up a feature engineering function to convert the categorical variable x2 into dummy variables (assuming A as the reference level for x2).

function feature_engineering(df)
    features = fill(1, nrow(df))
    features.x1 = df.x1
    features.x2_B = Int.(features.x2 .== "B")
    features.x2_C = Int.(features.x2 .== "C")
    return features
end

This can be quite tedious and error-prone. Fortunately, the StatsModels.jl package provides a @formula macro that can be used to specify the model matrix. This can be quite handy when combined with CategoricalArrays.jl and DataFrames.jl packages.

# If x2 is already stored as a CategoricalArray, that would be perfect
# df_copy = copy(df)
# Otherwise, a bit of copying is needed since DataFrames are immutable
df_copy = df[!, :x1]
df_copy.x2 = CategoricalArray(df.x2; levels=["A", "B", "C"], ordered=true)
df_copy.y = df.y

Next, the @formula macro can be called to generate the model matrix (see here for more details).

# set up a formula
fml = @formula(y ~ x1 + x2)
df_fml_schema = StatsModels.apply_schema(
    fml,
    StatsModels.schema(fml, df_copy)
)
# get y and X
y, X = StatsModels.modelcols(df_fml_schema, df_copy)
# convert y to a matrix, which is needed for LRMoE
y = reshape(y, length(y), 1)
# keep track of the column names
y_col, X_col = StatsModels.coefnames(df_fml_schema)

Now, the y and X matrices can be directly used to fit an LRMoE model.