Bugfix in PCA regression #197

mumichae · 2020-10-22T14:36:03Z

fixed PCA Var instead of Var^2
proper order of data/response in regression
added tests

…ession; added tests

LuckyMD · 2020-10-22T14:43:25Z

@LisaSikkema could you verify that this produces the same output as your code?

LuckyMD · 2020-10-22T14:45:41Z

@mumichae did you intend to commit all your test scripts already here? Or is this for the travis branch?

LuckyMD

This looks addressed, but needs testing.

mumichae · 2020-10-22T14:48:30Z

I intended the tests. They are basically running versions of what we had before. Pipeline tests are not yet working with the environments though.

LisaSikkema · 2020-10-22T15:04:58Z

yes corresponds to my output for categorical covariates.
One thing I forgot to say: if you're working with a continuous covariate, the way the code is written now, the continuous variable is still converted to dummies (so one dummy for every age, in my data). Not sure if that matters? I assume your batches are always categorical.
For the continuous data, if I skip the dummy-step and convert the covariate vector to the right shape, the values also match up perfectly.

LuckyMD · 2020-10-22T15:07:13Z

That matters... do you need to set the continuous covariate to be numerical somewhere?

LisaSikkema · 2020-10-22T15:09:27Z

It was numerical already. I suspect that's simply the way pd.get_dummies behaves when you feed it a single numerical vector

LuckyMD · 2020-10-22T16:53:26Z

Pipeline tests are not yet working with the environments though.

Maybe remove those that are not yet working then... we shouldn't merge things that are not tested.

mumichae · 2020-10-23T07:05:06Z

I have now added a case for continuous variables and tested against kBET. I get a small deviation 3e-5 but score itself is close to 0. Unfortunately I don't have a good batch variable with a larger score, so I'm not sure if that is just numerical noise.

LuckyMD

please use if/else and not try/except for this.

LuckyMD · 2020-10-23T09:05:01Z

scIB/metrics.py

+    try:
+        categorical = variable.dtype.name == 'category'
+    except:
+        categorical = not isinstance(variable[0], (int, float))


Why use error handling and not if else?

You mean check if the variable is of type series/dataframe with an if/else? I found this approach to be more readable and general towards whether it congestions a dtype or not. Would you prefer if/else type checking instead?

yes. I don't think we should catch all errors like this when we know what we are looking for. Ideally check if categorical, check if numerical... if neither throw an error.

Please do this as if else...

I included a different type check (with if/else) which will be in the next commit, once the values of kBet are reproducible

scIB/metrics.py

mumichae · 2020-10-23T09:33:52Z

We're dealing with potentially different datatypes here. I don't think it would make that much of a difference whether we use an exception or an if, it would rather be a cosmetic change.

LuckyMD · 2020-10-23T10:04:46Z

try/except catches all errors... that's a bit overkill... we should be specific and allow the code to throw errors if need be.

LisaSikkema · 2020-10-26T09:53:01Z

Yes, pr_regression bit works for me and corresponds to output of my own code.

mumichae · 2020-10-26T11:00:32Z

When I tried to reproduce the values in kBET pcRegression, I get completely different values for the pbmc3k dataset with batches scanpy tutorial
scIB pytest function:

scib/tests/conftest.py

Line 26 in 706a115

def adata_batch():

PC Regression with negative value:

scib/tests/api/test_metrics.py

Line 56 in 706a115

def test_pc_regression(adata_batch):

The conversion of categorical values to dummy values might be the cause for this discrepancy.

scib/scIB/metrics.py

Line 724 in 706a115

variable = pd.get_dummies(variable)

@LisaSikkema What code are you comparing against?

LisaSikkema · 2020-10-26T11:46:52Z

I'm comparing this code of yours (I wrote the first 5 lines to make my object compatible with your code):

cov = "age" # also tried "dataset"
verbose = True
X_pca = subadata.obsm["X_pca"]
pca_var = subadata.uns["pca"]["variance"]
n_comps = 50
variable = subadata.obs[cov].copy()
try:
    categorical = variable.dtype.name == "category"
except AttributeError:
    categorical = not isinstance(variable[0], (int, float))

if categorical:
    if verbose:
        print("one-hot encode categorical values")
    variable = pd.get_dummies(variable).to_numpy()
else:
    variable = np.array(variable).reshape(-1, 1)

# fit linear model for n_comps PCs
r2 = []
for i in range(n_comps):
    pc = X_pca[:, [i]]
    lm = sklearn.linear_model.LinearRegression()
    lm.fit(variable, pc)
    r2_score = lm.score(variable, pc)
    r2.append(r2_score)

Var = pca_var / sum(pca_var) * 100
R2Var = sum(r2 * Var) / 100

against this code of mine:

var_explained = pd.DataFrame(index=range(n_pcs), columns=covariates + ["overall"])
for pc in range(n_pcs):
    y_true_unfiltered = subadata.obsm["X_pca"][:, pc]
    var_explained.loc[pc, "overall"] = np.var(y_true_unfiltered)
    for cov in covariates:
        x = subadata.obs[cov].values.copy()
        x_nans = np.vectorize(utils.check_if_nan)(x)
        x = x[~x_nans]
        y_true = y_true_unfiltered[~x_nans].reshape(-1, 1)
        if x.dtype in ["float32", "float", "float64"]:
            x = x.reshape(-1, 1)
        else:
            if len(set(x)) == 1:
                var_explained.loc[pc, cov] = np.nan
                continue
            x = pd.get_dummies(x)
        lrf = LinearRegression(fit_intercept=True).fit(
            x,
            y_true,
        )
        y_pred = lrf.predict(x)
        var_explained.loc[pc, cov] = np.var(y_pred)
total_variance_explained = np.sum(var_explained, axis=0).sort_values(ascending=False)
total_variance_explained_fractions = (
    total_variance_explained / total_variance_explained["overall"]
)

for variables that have no nan or "nan" entries.

danielStrobl · 2020-10-26T18:14:00Z

When I tried to reproduce the values in kBET pcRegression, I get completely different values for the pbmc3k dataset with batches scanpy tutorial

Hey! So I've just tried to reproduce that with the dataset generated by the adata_batch() function. I'm getting very similar scores for both your implementation and the kBET one ( 8.99e-05). How did you test the kBET implementation?

This avoids negative values of the PC regression score

fixed PCA Var instead of Var^2, proper order of data/response in regr…

7a36bc3

…ession; added tests

mumichae requested a review from LuckyMD October 22, 2020 14:36

PCR fixes

986b86b

LuckyMD approved these changes Oct 22, 2020

View reviewed changes

using dummy only if the values are categorical or non-numeric

e2aaeb2

LuckyMD reviewed Oct 23, 2020

View reviewed changes

specifying caught error

6908186

using different dataset for testing

706a115

check for negative r2 scores in pc_regression

a164a02

This avoids negative values of the PC regression score

LuckyMD approved these changes Oct 26, 2020

View reviewed changes

LuckyMD merged commit ca41670 into master Oct 27, 2020

LuckyMD deleted the fix_pcregression branch October 27, 2020 09:12

LuckyMD mentioned this pull request Jan 5, 2021

pca variance explained used as stdev explained #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix in PCA regression #197

Bugfix in PCA regression #197

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bugfix in PCA regression #197

Bugfix in PCA regression #197

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!