-
Notifications
You must be signed in to change notification settings - Fork 74.7k
Arbitrary sized inputs for FCNs are slow. #5048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community. 8000
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's hard to come up up with any suggestions here without seeing how you're using the APIs. Is it possible to write up a small snippet of code that demonstrates the problem? |
Ah, I actually had a conversation with @vrv and have some more useful information for you. Short story: Can you re-run after setting the environment variable Long story: For some ops, such as Disabling auto-tuning with the environment variable will mean that you might not end up using the best algorithm, but should get consistent results. Hope that helps, let us know. Thanks! |
@asimshankar Hey, thanks! This was spot on, now it works very fast :) The issue bugged me for several days before I posted it here. I understand that my case can be quite rare, but perhaps to save other people trouble down the road it would be nice to make this feature somehow more explicit. Also, is it possible to manually choose those algorithms it tries to profile automatically? |
Glad it worked out! We'll try to figure out a good place to make this more discoverable (suggestions welcome). The set of algorithms is hardcoded in cuda_dnn.cc right now, so to modify them you'll need to modify that source file. Since your issue has been resolved, closing this. Happy training. |
@asimshankar perhaps a config class like RunConfig could be a way to handle this type of detail? Additionally could default to a decaying or learned (turtles all the way down...) flag enabling/disabling of autotune so no config is necessary for the typical use cases, including this one, then the config class mentioned above could be used in specific cases where other behavior is desired. |
There is a stackoverflow question for this for people who come across this in the future. @asimshankar to set the environment variable, do I literally in bash run: export TF_CUDNN_USE_AUTOTUNE=0
python myscript.py |
@ahundt We do have ConfigProto where generally useful options are added, similar to a RunConfig class. Config options often start as an environment variable, and get promoted to ConfigProto when enough people in the public use them for it to be better supported / documented, etc. Operationally, it's a lot easier to remove environment variables (they will just be ignored), and impossible to remove ConfigOptions (because then code will break), if, for example, one day we make auto-tuning fast enough that it doesn't matter. In any case, it's still not clear to us that we want to promote the disabling of auto-tuning into an option we can never take away, but hopefully this issue + the StackOverflow question will suffice for now -- thanks for doing that! And yes, that's how you set the env var. Let us know if it that didn't work for some reason. |
@vrv thanks for the detailed explanation! Would the environment variable also work if the model was run via the C++ API? |
Yeah, we read the environment variable in C++, so it doesn't matter how it's set, as long as it's set on the binary that's running Cudnn. |
@vrv great, thanks again! |
@asimshankar thanks |
@asimshankar Let's say that the input image sizes of my model are ( |
There are 2 options:
In my case tuned model performed ~30% better than without tuning... |
Hi,
OS and version: TF r0.11.0rc0, Linux 64bit, cudnn-7.5-v5.1
I am using TF-Slim and a FCN-style architecture based on ResNets. I experience extremely slow training times: 5-10x slower compared to an equivalent Caffe implementation.
I train fully-convolutionally, and my images are of arbitrary sizes and aspect ratios. The training code uses FIFOQueue and preloads data in a separate thread. I use batch_size=1 as all images are of different sizes.
If I feed dummy random numpy tensors of fixed size, it works very fast. I tried to generate input tensors of 10 predefined sizes and fed them sequentially, the first 10 iterations were slow, but then it speeds back up. Looks like it does some extra work for each input size. I only used Caffe before, and there it was possible to resize all tensors per batch efficiently, somehow.
Am I missing some simple trick or is it a bug?
The text was updated successfully, but these errors were encountered: