8000 Bugfix in PCA regression by mumichae · Pull Request #197 · theislab/scib · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Bugfix in PCA regression #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 27, 2020
Merged

Bugfix in PCA regression #197

merged 6 commits into from
Oct 27, 2020

Conversation

mumichae
Copy link
Collaborator
  • fixed PCA Var instead of Var^2
  • proper order of data/response in regression
  • added tests

@mumichae mumichae requested a review from LuckyMD October 22, 2020 14:36
@LuckyMD
Copy link
Collaborator
LuckyMD commented Oct 22, 2020

@LisaSikkema could you verify that this produces the same output as your code?

@LuckyMD
Copy link
Collaborator
LuckyMD commented Oct 22, 2020

@mumichae did you intend to commit all your test scripts already here? Or is this for the travis branch?

Copy link
Collaborator
@LuckyMD LuckyMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks addressed, but needs testing.

@mumichae
Copy link
Collaborator Author

I intended the tests. They are basically running versions of what we had before. Pipeline tests are not yet working with the environments though.

@LisaSikkema
Copy link
Contributor

yes corresponds to my output for categorical covariates.
One thing I forgot to say: if you're working with a continuous covariate, the way the code is written now, the continuous variable is still converted to dummies (so one dummy for every age, in my data). Not sure if that matters? I assume your batches are always categorical.
For the continuous data, if I skip the dummy-step and convert the covariate vector to the right shape, the values also match up perfectly.

@LuckyMD
Copy link
Collaborator
LuckyMD commented Oct 22, 2020

That matters... do you need to set the continuous covariate to be numerical somewhere?

@LisaSikkema
Copy link
Contributor

It was numerical already. I suspect that's simply the way pd.get_dummies behaves when you feed it a single numerical vector

@LuckyMD
Copy link
Collaborator
LuckyMD commented Oct 22, 2020

Pipeline tests are not yet working with the environments though.

Maybe remove those that are not yet working then... we shouldn't merge things that are not tested.

@mumichae
Copy link
Collaborator Author
mumichae commented Oct 23, 2020

I have now added a case for continuous variables and tested against kBET. I get a small deviation 3e-5 but score itself is close to 0. Unfortunately I don't have a good batch variable with a larger score, so I'm not sure if that is just numerical noise.

Copy link
Collaborator
@LuckyMD LuckyMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use if/else and not try/except for this.

scIB/metrics.py Outdated
try:
categorical = variable.dtype.name == 'category'
except:
categorical = not isinstance(variable[0], (int, float))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use error handling and not if else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean check if the variable is of type series/dataframe with an if/else? I found this approach to be more readable and general towards whether it congestions a dtype or not. Would you prefer if/else type checking instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I don't think we should catch all errors like this when we know what we are looking for. Ideally check if categorical, check if numerical... if neither throw an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do this as if else...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included a different type check (with if/else) which will be in the next commit, once the values of kBet are reproducible

@mumichae
8000 Copy link
Collaborator Author

We're dealing with potentially different datatypes here. I don't think it would make that much of a difference whether we use an exception or an if, it would rather be a cosmetic change.

@LuckyMD
Copy link
Collaborator
LuckyMD commented Oct 23, 2020

try/except catches all errors... that's a bit overkill... we should be specific and allow the code to throw errors if need be.

@LisaSikkema
Copy link
Contributor

Yes, pr_regression bit works for me and corresponds to output of my own code.

@mumichae
Copy link
Collaborator Author
mumichae commented Oct 26, 2020

When I tried to reproduce the values in kBET pcRegression, I get completely different values for the pbmc3k dataset with batches scanpy tutorial
scIB pytest function:

def adata_batch():

PC Regression with negative value:

def test_pc_regression(adata_batch):

The conversion of categorical values to dummy values might be the cause for this discrepancy.

variable = pd.get_dummies(variable)

@LisaSikkema What code are you comparing against?

@LisaSikkema
Copy link
Contributor

I'm comparing this code of yours (I wrote the first 5 lines to make my object compatible with your code):

cov = "age" # also tried "dataset"
verbose = True
X_pca = subadata.obsm["X_pca"]
pca_var = subadata.uns["pca"]["variance"]
n_comps = 50
variable = subadata.obs[cov].copy()
try:
    categorical = variable.dtype.name == "category"
except AttributeError:
    categorical = not isinstance(variable[0], (int, float))

if categorical:
    if verbose:
        print("one-hot encode categorical values")
    variable = pd.get_dummies(variable).to_numpy()
else:
    variable = np.array(variable).reshape(-1, 1)

# fit linear model for n_comps PCs
r2 = []
for i in range(n_comps):
    pc = X_pca[:, [i]]
    lm = sklearn.linear_model.LinearRegression()
    lm.fit(variable, pc)
    r2_score = lm.score(variable, pc)
    r2.append(r2_score)

Var = pca_var / sum(pca_var) * 100
R2Var = sum(r2 * Var) / 100

against this code of mine:

var_explained = pd.DataFrame(index=range(n_pcs), columns=covariates + ["overall"])
for pc in range(n_pcs):
    y_true_unfiltered = subadata.obsm["X_pca"][:, pc]
    var_explained.loc[pc, "overall"] = np.var(y_true_unfiltered)
    for cov in covariates:
        x = subadata.obs[cov].values.copy()
        x_nans = np.vectorize(utils.check_if_nan)(x)
        x = x[~x_nans]
        y_true = y_true_unfiltered[~x_nans].reshape(-1, 1)
        if x.dtype in ["float32", "float", "float64"]:
            x = x.reshape(-1, 1)
        else:
            if len(set(x)) == 1:
                var_explained.loc[pc, cov] = np.nan
                continue
            x = pd.get_dummies(x)
        lrf = LinearRegression(fit_intercept=True).fit(
            x,
            y_true,
        )
        y_pred = lrf.predict(x)
        var_explained.loc[pc, cov] = np.var(y_pred)
total_variance_explained = np.sum(var_explained, axis=0).sort_values(ascending=False)
total_variance_explained_fractions = (
    total_variance_explained / total_variance_explained["overall"]
)

for variables that have no nan or "nan" entries.

@danielStrobl
Copy link
Member

When I tried to reproduce the values in kBET pcRegression, I get completely different values for the pbmc3k dataset with batches scanpy tutorial

Hey! So I've just tried to reproduce that with the dataset generated by the adata_batch() function. I'm getting very similar scores for both your implementation and the kBET one ( 8.99e-05). How did you test the kBET implementation?

67ED
This avoids negative values of the PC regression score
@LuckyMD LuckyMD merged commit ca41670 into master Oct 27, 2020
@LuckyMD LuckyMD deleted the fix_pcregression branch October 27, 2020 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

38D0
4 participants
0