GH-16524 GLM - control variables - Regression, Binomial #16601

maurever · 2025-04-09T08:25:31Z

Add a parameter to specify control variables and to remove the effects of these variables for prediction and calculation of model metrics.

When fitting the GLM, control variables are also fitted in the model, like a regular predictor.
After the model is fitted, the control variables' effects are removed when predicting with the model and calculating metrics.

Requirements from customer:

1. During model training, which metrics should be used for optimization purposes (like early stopping or lambda search)?

In addition, we would prefer the same to be applied for the offset. In other words, during model training, for the metrics used for optimization purposes (like early stopping or lambda search), we would prefer the metrics to be calculated with both control variable effects and offset effects included.

2. For variable importance calculations, would you prefer:
a) To include control variables in importance rankings but mark them as "control"
b) To exclude control variables from importance rankings entirely

We would prefer to include control variables in importance rankings, but mark them as "control".

3. When displaying model metrics in output summaries, what would you prefer?

We would prefer the same to be applied for the offset. In other words, we would prefer two sets of metrics to be displayed: (1) with both control effects and offset included, and (2) with both control effects and offset excluded

TODO:

Add new parameter control variables
Implement scoring with/without control variables for regression distributions and the binomial distribution
Calculation scoring metrics with/without control variables (early stopping metrics only)
Edit scoring history table (add new metrics)
Variable importance - mark control variable (for example variable_control)
~~Implement scoring with/without control variables for the multinomial distribution~~ (will be implemented in separated PR)

Tests:

The new parameter validation in Java
Test functionality with basic data in Java
Basic test control variables work in Python
Basic test control variables work in R
Scoring, prediction with/without control variables
Check scoring metrics with/without control variables
Generation Scoring history table
Variable importance

Other implementation (will be implemented in different PRs)

Grid search
Lambda search
Interactions

h2o-algos/src/test/java/hex/glm/GLMControlVariablesTest.java

+
+    private Frame scoreManualWithCoefficients(Double[] coefficients, Frame data, String frameName, int[] controlVariablesIdxs, boolean binomial){
+        Vec predictions = Vec.makeZero(data.numRows(), Vec.T_NUM);
+        for (int i = 0; i < data.numRows(); i++) {


To fix the issue, the type of the loop variable i should be changed from int to long. This ensures that the loop variable is at least as wide as the type of data.numRows(), preventing overflow and ensuring the loop condition is evaluated correctly. The change is localized to the loop declaration and does not affect the rest of the code's functionality.

Steps to implement the fix:

Update the type of the loop variable i from int to long in the for loop on line 636.

Ensure that all references to i within the loop remain compatible with the long type.

tomasfryda · 2025-06-23T10:56:53Z

h2o-r/tests/testdir_algos/glm/runit_GLM_control_variables.R

+    betas.cont <- model.cont@model$coefficients_table$coefficients
+    print(betas.cont)
+
+    expect_equal(res.dev, res.dev.cont)


Shouldn't here also be res.dev when control variables are excluded?

I would probably default the metrics to the control variables excluded case so it's less confusing. Why would it be otherwise confusing? A user might try to explore the model and calculate the metrics by themself but what will happen when they try it? They will find some discrepancy because IIRC our predict should predict without control variables but the metrics are calculated with them.

So I think there should be something like res.dev and res.dev.with.control.variables. WDYT @maurever ?

tomasfryda · 2025-06-23T15:33:52Z

h2o-algos/src/main/java/hex/DataInfo.java

Are changes in this file really necessary? It seems to me that instead of _adaptedFrameNames I can use _adaptedFrame.names(). It might seem longer and maybe slower but adding one variable that needs to be changed when some other variable (Frame) changes seems to me to be error prone.

tomasfryda · 2025-06-23T15:34:02Z

h2o-algos/src/main/java/hex/glm/GLMModel.java

@@ -1685,7 +1780,8 @@ public GLMOutput() {
    public GLMOutput(GLM glm) {
      super(glm);
      _dinfo = glm._dinfo.clone();
-      _dinfo._adaptedFrame = null;
+      _dinfo._adaptedFrame = null; 
+      _dinfo._adaptedFrameNames = glm._dinfo._adaptedFrameNames;
      String[] cnames = glm._dinfo.coefNames();
      String [] names = glm._dinfo._adaptedFrame._names;


names seem to be the same as _adaptedFrameNames. Also it would be good to know how does it differ from _coefficient_names. If you don't know, I'll try to find out when doing the final review.

There is already too much "names" so I would prefer to not add any other if it's not necessary.

Implement control variables, add junit, pyunit

3677093

maurever added feature glm labels Apr 9, 2025

maurever added this to the 3.48.0.1 milestone Apr 9, 2025

maurever self-assigned this Apr 9, 2025

maurever added 4 commits April 24, 2025 10:23

Fix scoring, add test cases.

0e43e96

8000 Fix score0 to respect customer request, improve tests

954eec3

Implement scoring history and variable importance

98c06f5

Improve suffix, add simple data test

bf7c2fc

github-advanced-security bot found potential problems Jun 6, 2025

View reviewed changes

maurever changed the title ~~GH-16524 GLM - control variables~~ GH-16524 GLM - control variables - Gaussian, Bernoulli Jun 16, 2025

Add copilot suggestions

87b23bb

maurever requested review from valenad1 and tomasfryda June 16, 2025 09:40

maurever mentioned this pull request Jun 16, 2025

Controls and Offset variables in GLM #16524

Open

4 tasks

maurever changed the title ~~GH-16524 GLM - control variables - Gaussian, Bernoulli~~ GH-16524 GLM - control variables - Regression, Binomial Jun 18, 2025

tomasfryda reviewed Jun 23, 2025

View reviewed changes

@@ -635,3 +635,3 @@
                     Vec predictions = Vec.makeZero(data.numRows(), Vec.T_NUM);
-                    for (int i = 0; i < data.numRows(); i++) {
+                    for (long i = 0; i < data.numRows(); i++) {
                         double prediction = 0;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-16524 GLM - control variables - Regression, Binomial #16601

GH-16524 GLM - control variables - Regression, Binomial #16601

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check failure

Copilot Autofix

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GH-16524 GLM - control variables - Regression, Binomial #16601

Are you sure you want to change the base?

GH-16524 GLM - control variables - Regression, Binomial #16601

Uh oh!

Conversation

Uh oh!

Requirements from customer:

TODO:

Tests:

Uh oh!

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Copilot Autofix

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!