8000 A question about microbatching and its impact on the compute_dp_sgd_privacy_statement function. · Issue #620 · tensorflow/privacy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A question about microbatching and its impact on the compute_dp_sgd_privacy_statement function. #620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mawolz opened this issue Apr 28, 2025 · 0 comments

Comments

@mawolz
Copy link
mawolz commented Apr 28, 2025

Hello contributors of TensorFlow Privacy,

I am having a hard time understanding the changes to the privacy guarantee when using microbatches. I will provide an in-depth explanation of my current understanding, but there is a TLDR at the end of this post.

The function tf_privacy.compute_dp_sgd_privacy is deprecated since it assumes Poisson sampling and "does not account for doubling of sensitivity with microbatching."
The documentation of the new function, compute_dp_sgd_privacy_statement, points to ("How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy", https://arxiv.org/abs/2303.00654, Sec 5.6.) for further explanation.

The basic modifications of microbatching are:
- Clip the average per microbatch gradient to clipping norm C.
- Sum the averaged per-microbatch gradients.
- Add noise proportional to the clipping norm and noise multiplier.
- Divide by the number of microbatches.

However, the chapter also mentions how exactly the privacy guarantee changes when using microbatching. Most notably, the chapter says, "...epsilon remains the same as in the no microbatching setting, but this approach adds more noise. The standard deviation used is 2k times larger, where k is the size of each microbatch."

In the code (https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/analysis/compute_dp_sgd_privacy_lib.py) microbatching mainly does the following:
if used_microbatching:
noise_multiplier /= 2

Initially this is quite confusing given the previous explanation by Ponomareva et al. I thought that there was some kind of mistake and that the line should read:
if used_microbatching:
noise_multiplier *= 2size_microbatches

The comments of the aforementioned Python file mention that the noise multiplier is halved to account for the doubling of sensitivity, which is described by Ponomareva et al. This would mean that we use a /2 to counteract the *2 that was introduced with microbatching. I don't understand why this is necessary. Is this done to uphold some equality?

Whatever the case might be, the paper also states that microbatching adds k times more noise. I don't think this is included anywhere in the epsilon calculation. Why is this not addressed?

There is probably something that I am overlooking or don't understand, but I can't find that thing. Is it correct that the noise multiplier just gets cut in half, or is this an oversight?

TLDR: The code for the epsilon calculation, compute_dp_sgd_privacy_statement, of TensorFlow Privacy halves the noise multiplier when microbatching is enabled. This correction is applied both in the case of Poisson sampling and not. According to the relvant paper ("How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy", (https://arxiv.org/abs/2303.00654) Sec 5.6.), microbatching adds more noise to the Gaussian Mechanism of DP-SGD. Why is the noise multiplier halved? Does the number of microbatches and their size really not matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0