@formula
Macro in StatsModels.jl
for Model MatricesA noticeable shortcoming of the LRMoE.jl package, as of version 0.3.2, is the lack of support for a formula
interface.
For example, when fitting (G)LM in R, one can simply call the following without worrying about the data types of the columns. This can be particularly handy when x2
is categorical (e.g. with levels A
, B
and C
).
lm_r = lm(y ~ x1 + x2, data = df)
This is the same case with GLM.jl
in Julia (see here).
lm_julia = GLM.lm(@formula(y ~ x1 + x2), df)
Whenever I wanted to fit an LRMoE on the same df
, I used to manually code up a feature engineering function to convert the categorical variable x2
into dummy variables (assuming A
as the reference level for x2
).
function feature_engineering(df)
features = fill(1, nrow(df))
features.x1 = df.x1
features.x2_B = Int.(features.x2 .== "B")
features.x2_C = Int.(features.x2 .== "C")
return features
end
This can be quite tedious and error-prone. Fortunately, the StatsModels.jl
package provides a @formula
macro that can be used to specify the model matrix. This can be quite handy when combined with CategoricalArrays.jl
and DataFrames.jl
packages.
# If x2 is already stored as a CategoricalArray, that would be perfect
# df_copy = copy(df)
# Otherwise, a bit of copying is needed since DataFrames are immutable
df_copy = df[!, :x1]
df_copy.x2 = CategoricalArray(df.x2; levels=["A", "B", "C"], ordered=true)
df_copy.y = df.y
Next, the @formula
macro can be called to generate the model matrix (see here for more details).
# set up a formula
fml = @formula(y ~ x1 + x2)
df_fml_schema = StatsModels.apply_schema(
fml,
StatsModels.schema(fml, df_copy)
)
# get y and X
y, X = StatsModels.modelcols(df_fml_schema, df_copy)
# convert y to a matrix, which is needed for LRMoE
y = reshape(y, length(y), 1)
# keep track of the column names
y_col, X_col = StatsModels.coefnames(df_fml_schema)
Now, the y
and X
matrices can be directly used to fit an LRMoE model.