8000 GH-16524 GLM - control variables - Regression, Binomial by maurever · Pull Request #16601 · h2oai/h2o-3 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

GH-16524 GLM - control variables - Regression, Binomial #16601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

maurever
Copy link
Contributor
@maurever maurever commented Apr 9, 2025

Issue #16524

Add a parameter to specify control variables and to remove the effects of these variables for prediction and calculation of model metrics.

When fitting the GLM, control variables are also fitted in the model, like a regular predictor.
After the model is fitted, the control variables' effects are removed when predicting with the model and calculating metrics.

Requirements from customer:

1. During model training, which metrics should be used for optimization purposes (like early stopping or lambda search)?

In addition, we would prefer the same to be applied for the offset. In other words, during model training, for the metrics used for optimization purposes (like early stopping or lambda search), we would prefer the metrics to be calculated with both control variable effects and offset effects included.

2. For variable importance calculations, would you prefer:
a) To include control variables in importance rankings but mark them as "control"
b) To exclude control variables from importance rankings entirely

We would prefer to include control variables in importance rankings, but mark them as "control".

3. When displaying model metrics in output summaries, what would you prefer?

We would prefer the same to be applied for the offset. In other words, we would prefer two sets of metrics to be displayed: (1) with both control effects and offset included, and (2) with both control effects and offset excluded

TODO:

  • Add new parameter control variables
  • Implement scoring with/without control variables for regression distributions and the binomial distribution
  • Calculation scoring metrics with/without control variables (early stopping metrics only)
  • Edit scoring history table (add new metrics)
  • Variable importance - mark control variable (for example variable_control)
  • Implement scoring with/without control variables for the multinomial distribution (will be implemented in separated PR)

Tests:

  • The new parameter validation in Java
  • Test functionality with basic data in Java
  • Basic test control variables work in Python
  • Basic test control variables work in R
  • Scoring, prediction with/without control variables
  • Check scoring metrics with/without control variables
  • Generation Scoring history table
  • Variable importance

Other implementation (will be implemented in different PRs)

  • Grid search
  • Lambda search
  • Interactions

@maurever maurever added this to the 3.48.0.1 milestone Apr 9, 2025
@maurever maurever self-assigned this Apr 9, 2025

private Frame scoreManualWithCoefficients(Double[] coefficients, Frame data, String frameName, int[] controlVariablesIdxs, boolean binomial){
Vec predictions = Vec.makeZero(data.numRows(), Vec.T_NUM);
for (int i = 0; i < data.numRows(); i++) {

Check failure

Code scanning / CodeQL

Comparison of narrow type with wide type in loop condition High test

Comparison between
expression
of type int and
expression
of wider type long.

Copilot Autofix

AI 8 days ago

To fix the issue, the type of the loop variable i should be changed from int to long. This ensures that the loop variable is at least as wide as the type of data.numRows(), preventing overflow and ensuring the loop condition is evaluated correctly. The change is localized to the loop declaration and does not affect the rest of the code's functionality.

Steps to implement the fix:

  1. Update the type of the loop variable i from int to long in the for loop on line 636.
  2. Ensure that all references to i within the loop remain compatible with the long type.

Suggested changeset 1
h2o-algos/src/test/java/hex/glm/GLMControlVariablesTest.java

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/h2o-algos/src/test/java/hex/glm/GLMControlVariablesTest.java b/h2o-algos/src/test/java/hex/glm/GLMControlVariablesTest.java
--- a/h2o-algos/src/test/java/hex/glm/GLMControlVariablesTest.java
+++ b/h2o-algos/src/test/java/hex/glm/GLMControlVariablesTest.java
@@ -635,3 +635,3 @@
         Vec predictions = Vec.makeZero(data.numRows(), Vec.T_NUM);
-        for (int i = 0; i < data.numRows(); i++) {
+        for (long i = 0; i < data.numRows(); i++) {
             double prediction = 0;
EOF
@@ -635,3 +635,3 @@
Vec predictions = Vec.makeZero(data.numRows(), Vec.T_NUM);
for (int i = 0; i < data.numRows(); i++) {
for (long i = 0; i < data.numRows(); i++) {
double prediction = 0;
Copilot is powered by AI and may make mistakes. Always verify output.
@maurever maurever changed the title GH-16524 GLM - control variables GH-16524 GLM - control variables - Gaussian, Bernoulli Jun 16, 2025
@maurever maurever requested review from valenad1 and tomasfryda June 16, 2025 09:40
@maurever maurever changed the title GH-16524 GLM - control variables - Gaussian, Bernoulli GH-16524 GLM - control variables - Regression, Binomial Jun 18, 2025
betas.cont <- model.cont@model$coefficients_table$coefficients
print(betas.cont)

expect_equal(res.dev, res.dev.cont)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't here also be res.dev when control variables are excluded?

I would probably default the metrics to the control variables excluded case so it's less confusing. Why would it be otherwise confusing? A user might try to explore the model and calculate the metrics by themself but what will happen when they try it? They will find some discrepancy because IIRC our predict should predict without control variables but the metrics are calculated with them.

So I think there should be something like res.dev and res.dev.with.control.variables. WDYT @maurever ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are changes in this file really necessary? It seems to me that instead of _adaptedFrameNames I can use _adaptedFrame.names(). It might seem longer and maybe slower but adding one variable that needs to be changed when some other variable (Frame) changes seems to me to be error prone.

@@ -1685,7 +1780,8 @@ public GLMOutput() {
public GLMOutput(GLM glm) {
super(glm);
_dinfo = glm._dinfo.clone();
_dinfo._adaptedFrame = null;
_dinfo._adaptedFrame = null;
_dinfo._adaptedFrameNames = glm._dinfo._adaptedFrameNames;
String[] cnames = glm._dinfo.coefNames();
String [] names = glm._dinfo._adaptedFrame._names;
Copy link
Contributor
@tomasfryda tomasfryda Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names seem to be the same as _adaptedFrameNames. Also it would be good to know how does it differ from _coefficient_names. If you don't know, I'll try to find out when doing the final review.

There is already too much "names" so I would prefer to not add any other if it's not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0