8000 Arbitrary sized inputs for FCNs are slow. · Issue #5048 · tensorflow/tensorflow · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Arbitrary sized inputs for FCNs are slow. #5048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community. 8000

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eldar opened this issue Oct 18, 2016 · 13 comments
Closed

Arbitrary sized inputs for FCNs are slow. #5048

eldar opened this issue Oct 18, 2016 · 13 comments

Comments

@eldar
Copy link
eldar commented Oct 18, 2016

Hi,

OS and version: TF r0.11.0rc0, Linux 64bit, cudnn-7.5-v5.1

I am using TF-Slim and a FCN-style architecture based on ResNets. I experience extremely slow training times: 5-10x slower compared to an equivalent Caffe implementation.

I train fully-convolutionally, and my images are of arbitrary sizes and aspect ratios. The training code uses FIFOQueue and preloads data in a separate thread. I use batch_size=1 as all images are of different sizes.

If I feed dummy random numpy tensors of fixed size, it works very fast. I tried to generate input tensors of 10 predefined sizes and fed them sequentially, the first 10 iterations were slow, but then it speeds back up. Looks like it does some extra work for each input size. I only used Caffe before, and there it was possible to resize all tensors per batch efficiently, somehow.

Am I missing some simple trick or is it a bug?

@asimshankar
Copy link
Contributor

It's hard to come up up with any suggestions here without seeing how you're using the APIs. Is it possible to write up a small snippet of code that demonstrates the problem?

@asimshankar asimshankar added the stat:awaiting response Status - Awaiting response from author label Oct 18, 2016
@asimshankar
Copy link
Contributor

Ah, I actually had a conversation with @vrv and have some more useful information for you.

Short story: Can you re-run after setting the environment variable TF_CUDNN_USE_AUTOTUNE=0?

Long story: For some ops, such as Conv2D (on GPUs), TensorFlow supports "auto-tuning", which means that it profiles multiple algorithms for the operation before selecting one. This selection is cached based on the shapes of the tensors and some other parameters, but the first time new shapes are encountered, it has to re-profile. This is likely what is causing the slowdown when you provide arbitrary shapes and speeds things up after it has profiled the 10 predefined sizes you had provided.

Disabling auto-tuning with the environment variable will mean that you might not end up using the best algorithm, but should get consistent results.

Hope that helps, let us know. Thanks!

@eldar
Copy link
Author
eldar commented Oct 19, 2016

@asimshankar Hey, thanks! This was spot on, now it works very fast :) The issue bugged me for several days before I posted it here. I understand that my case can be quite rare, but perhaps to save other people trouble down the road it would be nice to make this feature somehow more explicit. Also, is it possible to manually choose those algorithms it tries to profile automatically?

@asimshankar
Copy link
Contributor

Glad it worked out! We'll try to figure out a good place to make this more discoverable (suggestions welcome). The set of algorithms is hardcoded in cuda_dnn.cc right now, so to modify them you'll need to modify that source file.

Since your issue has been resolved, closing this. Happy training.

@asimshankar asimshankar removed the stat:awaiting response Status - Awaiting response from author label Oct 19, 2016
@ahundt
Copy link
Contributor
ahundt commented Dec 13, 2016

@asimshankar perhaps a config class like RunConfig could be a way to handle this type of detail?

Additionally could default to a decaying or learned (turtles all the way down...) flag enabling/disabling of autotune so no config is necessary for the typical use cases, including this one, then the config class mentioned above could be used in specific cases where other behavior is desired.

@ahundt
Copy link
Contributor
ahundt commented Dec 13, 2016

There is a stackoverflow question for this for people who come across this in the future.

@asimshankar to set the environment variable, do I literally in bash run:

export TF_CUDNN_USE_AUTOTUNE=0
python myscript.py

@vrv
Copy link
vrv commented Dec 13, 2016

@ahundt We do have ConfigProto where generally useful options are added, similar to a RunConfig class. Config options often start as an environment variable, and get promoted to ConfigProto when enough people in the public use them for it to be better supported / documented, etc. Operationally, it's a lot easier to remove environment variables (they will just be ignored), and impossible to remove ConfigOptions (because then code will break), if, for example, one day we make auto-tuning fast enough that it doesn't matter.

In any case, it's still not clear to us that we want to promote the disabling of auto-tuning into an option we can never take away, but hopefully this issue + the StackOverflow question will suffice for now -- thanks for doing that!

And yes, that's how you set the env var. Let us know if it that didn't work for some reason.

@ahundt
Copy link
Contributor
ahundt commented Dec 13, 2016

@vrv thanks for the detailed explanation! Would the environment variable also work if the model was run via the C++ API?

@vrv
Copy link
vrv commented Dec 13, 2016

Yeah, we read the environment variable in C++, so it doesn't matter how it's set, as long as it's set on the binary that's running Cudnn.

@ahundt
Copy link
Contributor
ahundt commented Dec 13, 2016

@vrv great, thanks again!

@zh794390558
Copy link
Contributor
zh794390558 commented Aug 7, 2017

@asimshankar thanks

@kriskorrel-cw
Copy link
kriskorrel-cw commented Jul 15, 2020

@asimshankar
Great! I now found the culprit of the huge differences in training time between fixed and dynamic input image sizes.
I know I'm late to the party, but I was wondering whether you or anybody else knows how this auto-tuner works internally and what effect it would have on the long term.
Would the auto-tuning stop and find the optimal algorithm after some predefined number of tries, or could it theoretically re-profile indefinitely provided that input image size are random (in a large enough domain).

Let's say that the input image sizes of my model are (h, w, 3) where h and w are uniformly sampled between 500 and 1500, and I should train this model for multiple days anyways. Would it make sense to enable the auto-tuner, or should it basically always be disabled in such a setting?

@AndreyOrb
Copy link

There are 2 options:

  1. Enable auto-tuning (default) - The auto-tuning algorithm is executed for EACH new tuple (batch_size, h,w, color) for EACH cnn layer. It runs the layer against all plans, defined in CuDNN (~10 plan), where the processing time may highly vary between 200nano to 200milli seconds (in my case). So "heating" the model with a single batch size can easily take several seconds. Once again, it depends on a model (how many layers, density, etc.) It then chooses the best plan (least execution time) for each layer.
    If you run again with the same dimension tuple (batch_size, h,w, color), best plans are already chosen and you will get the best execution/processing time. If you restart the process, the mapping config_tuple-best_plan is gone and the auto-tuning process will start from the beginning.

  2. Disable auto-tuning (TF_CUDNN_USE_AUTOTUNE=0) - you overcome the tuning phase and the default plan is chosen. But is can be not optimal.

In my case tuned model performed ~30% better than without tuning...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants
0