-
Notifications
You must be signed in to change notification settings - Fork 69
Bugfix in PCA regression #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- fixed PCA Var instead of Var^2
- proper order of data/response in regression
- added tests
…ession; added tests
@LisaSikkema could you verify that this produces the same output as your code? |
@mumichae did you intend to commit all your test scripts already here? Or is this for the travis branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks addressed, but needs testing.
I intended the tests. They are basically running versions of what we had before. Pipeline tests are not yet working with the environments though. |
yes corresponds to my output for categorical covariates. |
That matters... do you need to set the continuous covariate to be numerical somewhere? |
It was numerical already. I suspect that's simply the way pd.get_dummies behaves when you feed it a single numerical vector |
Maybe remove those that are not yet working then... we shouldn't merge things that are not tested. |
I have now added a case for continuous variables and tested against kBET. I get a small deviation 3e-5 but score itself is close to 0. Unfortunately I don't have a good batch variable with a larger score, so I'm not sure if that is just numerical noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use if/else and not try/except for this.
scIB/metrics.py
Outdated
try: | ||
categorical = variable.dtype.name == 'category' | ||
except: | ||
categorical = not isinstance(variable[0], (int, float)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use error handling and not if else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean check if the variable is of type series/dataframe with an if/else? I found this approach to be more readable and general towards whether it congestions a dtype or not. Would you prefer if/else type checking instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I don't think we should catch all errors like this when we know what we are looking for. Ideally check if categorical, check if numerical... if neither throw an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do this as if else...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included a different type check (with if/else) which will be in the next commit, once the values of kBet are reproducible
We're dealing with potentially different datatypes here. I don't think it would make that much of a difference whether we use an exception or an if, it would rather be a cosmetic change. |
try/except catches all errors... that's a bit overkill... we should be specific and allow the code to throw errors if need be. |
Yes, pr_regression bit works for me and corresponds to output of my own code. |
When I tried to reproduce the values in kBET pcRegression, I get completely different values for the pbmc3k dataset with batches scanpy tutorial Line 26 in 706a115
PC Regression with negative value: scib/tests/api/test_metrics.py Line 56 in 706a115
The conversion of categorical values to dummy values might be the cause for this discrepancy. Line 724 in 706a115
@LisaSikkema What code are you comparing against? |
I'm comparing this code of yours (I wrote the first 5 lines to make my object compatible with your code):
against this code of mine:
for variables that have no nan or "nan" entries. |
Hey! So I've just tried to reproduce that with the dataset generated by the |
This avoids negative values of the PC regression score