-
Notifications
You must be signed in to change notification settings - Fork 2k
GH-16524 GLM - control variables - Regression, Binomial #16601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
||
private Frame scoreManualWithCoefficients(Double[] coefficients, Frame data, String frameName, int[] controlVariablesIdxs, boolean binomial){ | ||
Vec predictions = Vec.makeZero(data.numRows(), Vec.T_NUM); | ||
for (int i = 0; i < data.numRows(); i++) { |
Check failure
Code scanning / CodeQL
Comparison of narrow type with wide type in loop condition High test
expression
expression
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 8 days ago
To fix the issue, the type of the loop variable i
should be changed from int
to long
. This ensures that the loop variable is at least as wide as the type of data.numRows()
, preventing overflow and ensuring the loop condition is evaluated correctly. The change is localized to the loop declaration and does not affect the rest of the code's functionality.
Steps to implement the fix:
- Update the type of the loop variable
i
fromint
tolong
in thefor
loop on line 636. - Ensure that all references to
i
within the loop remain compatible with thelong
type.
-
Copy modified line R636
@@ -635,3 +635,3 @@ | ||
Vec predictions = Vec.makeZero(data.numRows(), Vec.T_NUM); | ||
for (int i = 0; i < data.numRows(); i++) { | ||
for (long i = 0; i < data.numRows(); i++) { | ||
double prediction = 0; |
betas.cont <- model.cont@model$coefficients_table$coefficients | ||
print(betas.cont) | ||
|
||
expect_equal(res.dev, res.dev.cont) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't here also be res.dev when control variables are excluded?
I would probably default the metrics to the control variables excluded case so it's less confusing. Why would it be otherwise confusing? A user might try to explore the model and calculate the metrics by themself but what will happen when they try it? They will find some discrepancy because IIRC our predict should predict without control variables but the metrics are calculated with them.
So I think there should be something like res.dev
and res.dev.with.control.variables
. WDYT @maurever ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are changes in this file really necessary? It seems to me that instead of _adaptedFrameNames
I can use _adaptedFrame.names()
. It might seem longer and maybe slower but adding one variable that needs to be changed when some other variable (Frame) changes seems to me to be error prone.
@@ -1685,7 +1780,8 @@ public GLMOutput() { | |||
public GLMOutput(GLM glm) { | |||
super(glm); | |||
_dinfo = glm._dinfo.clone(); | |||
_dinfo._adaptedFrame = null; | |||
_dinfo._adaptedFrame = null; | |||
_dinfo._adaptedFrameNames = glm._dinfo._adaptedFrameNames; | |||
String[] cnames = glm._dinfo.coefNames(); | |||
String [] names = glm._dinfo._adaptedFrame._names; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
names
seem to be the same as _adaptedFrameNames
. Also it would be good to know how does it differ from _coefficient_names
. If you don't know, I'll try to find out when doing the final review.
There is already too much "names" so I would prefer to not add any other if it's not necessary.
Issue #16524
Add a parameter to specify control variables and to remove the effects of these variables for prediction and calculation of model metrics.
When fitting the GLM, control variables are also fitted in the model, like a regular predictor.
After the model is fitted, the control variables' effects are removed when predicting with the model and calculating metrics.
Requirements from customer:
1. During model training, which metrics should be used for optimization purposes (like early stopping or lambda search)?
In addition, we would prefer the same to be applied for the offset. In other words, during model training, for the metrics used for optimization purposes (like early stopping or lambda search), we would prefer the metrics to be calculated with both control variable effects and offset effects included.
2. For variable importance calculations, would you prefer:
a) To include control variables in importance rankings but mark them as "control"
b) To exclude control variables from importance rankings entirely
We would prefer to include control variables in importance rankings, but mark them as "control".
3. When displaying model metrics in output summaries, what would you prefer?
We would prefer the same to be applied for the offset. In other words, we would prefer two sets of metrics to be displayed: (1) with both control effects and offset included, and (2) with both control effects and offset excluded
TODO:
Implement scoring with/without control variables for the multinomial distribution(will be implemented in separated PR)Tests:
Other implementation (will be implemented in different PRs)