StatsModels.jlfor Model Matrices
A noticeable shortcoming of the LRMoE.jl package, as of version 0.3.2, is the lack of support for a
For example, when fitting (G)LM in R, one can simply call the following without worrying about the data types of the columns. This can be particularly handy when
x2 is categorical (e.g. with levels
lm_r = lm(y ~ x1 + x2, data = df)
This is the same case with
GLM.jl in Julia (see here).
lm_julia = GLM.lm((y ~ x1 + x2), df)
Whenever I wanted to fit an LRMoE on the same
df, I used to manually code up a feature engineering function to convert the categorical variable
x2 into dummy variables (assuming
A as the reference level for
function feature_engineering(df) features = fill(1, nrow(df)) features.x1 = df.x1 features.x2_B = Int.(features.x2 .== "B") features.x2_C = Int.(features.x2 .== "C") return features end
This can be quite tedious and error-prone. Fortunately, the
StatsModels.jl package provides a
@formula macro that can be used to specify the model matrix. This can be quite handy when combined with
# If x2 is already stored as a CategoricalArray, that would be perfect # df_copy = copy(df) # Otherwise, a bit of copying is needed since DataFrames are immutable df_copy = df[!, :x1] df_copy.x2 = CategoricalArray(df.x2; levels=["A", "B", "C"], ordered=true) df_copy.y = df.y
@formula macro can be called to generate the model matrix (see here for more details).
# set up a formula fml = (y ~ x1 + x2) df_fml_schema = StatsModels.apply_schema( fml, StatsModels.schema(fml, df_copy) ) # get y and X y, X = StatsModels.modelcols(df_fml_schema, df_copy) # convert y to a matrix, which is needed for LRMoE y = reshape(y, length(y), 1) # keep track of the column names y_col, X_col = StatsModels.coefnames(df_fml_schema)
X matrices can be directly used to fit an LRMoE model.