Arbitrary sized inputs for FCNs are slow. #5048

eldar · 2016-10-18T20:09:14Z

Hi,

OS and version: TF r0.11.0rc0, Linux 64bit, cudnn-7.5-v5.1

I am using TF-Slim and a FCN-style architecture based on ResNets. I experience extremely slow training times: 5-10x slower compared to an equivalent Caffe implementation.

I train fully-convolutionally, and my images are of arbitrary sizes and aspect ratios. The training code uses FIFOQueue and preloads data in a separate thread. I use batch_size=1 as all images are of different sizes.

If I feed dummy random numpy tensors of fixed size, it works very fast. I tried to generate input tensors of 10 predefined sizes and fed them sequentially, the first 10 iterations were slow, but then it speeds back up. Looks like it does some extra work for each input size. I only used Caffe before, and there it was possible to resize all tensors per batch efficiently, somehow.

Am I missing some simple trick or is it a bug?

asimshankar · 2016-10-18T23:25:09Z

It's hard to come up up with any suggestions here without seeing how you're using the APIs. Is it possible to write up a small snippet of code that demonstrates the problem?

asimshankar · 2016-10-18T23:52:05Z

Ah, I actually had a conversation with @vrv and have some more useful information for you.

Short story: Can you re-run after setting the environment variable TF_CUDNN_USE_AUTOTUNE=0?

Long story: For some ops, such as Conv2D (on GPUs), TensorFlow supports "auto-tuning", which means that it profiles multiple algorithms for the operation before selecting one. This selection is cached based on the shapes of the tensors and some other parameters, but the first time new shapes are encountered, it has to re-profile. This is likely what is causing the slowdown when you provide arbitrary shapes and speeds things up after it has profiled the 10 predefined sizes you had provided.

Disabling auto-tuning with the environment variable will mean that you might not end up using the best algorithm, but should get consistent results.

Hope that helps, let us know. Thanks!

eldar · 2016-10-19T08:46:48Z

@asimshankar Hey, thanks! This was spot on, now it works very fast :) The issue bugged me for several days before I posted it here. I understand that my case can be quite rare, but perhaps to save other people trouble down the road it would be nice to make this feature somehow more explicit. Also, is it possible to manually choose those algorithms it tries to profile automatically?

asimshankar · 2016-10-19T09:05:40Z

Glad it worked out! We'll try to figure out a good place to make this more discoverable (suggestions welcome). The set of algorithms is hardcoded in cuda_dnn.cc right now, so to modify them you'll need to modify that source file.

Since your issue has been resolved, closing this. Happy training.

ahundt · 2016-12-13T01:05:07Z

@asimshankar perhaps a config class like RunConfig could be a way to handle this type of detail?

Additionally could default to a decaying or learned (turtles all the way down...) flag enabling/disabling of autotune so no config is necessary for the typical use cases, including this one, then the config class mentioned above could be used in specific cases where other behavior is desired.

ahundt · 2016-12-13T01:10:54Z

There is a stackoverflow question for this for people who come across this in the future.

@asimshankar to set the environment variable, do I literally in bash run:

export TF_CUDNN_USE_AUTOTUNE=0
python myscript.py

vrv · 2016-12-13T01:30:01Z

@ahundt We do have ConfigProto where generally useful options are added, similar to a RunConfig class. Config options often start as an environment variable, and get promoted to ConfigProto when enough people in the public use them for it to be better supported / documented, etc. Operationally, it's a lot easier to remove environment variables (they will just be ignored), and impossible to remove ConfigOptions (because then code will break), if, for example, one day we make auto-tuning fast enough that it doesn't matter.

In any case, it's still not clear to us that we want to promote the disabling of auto-tuning into an option we can never take away, but hopefully this issue + the StackOverflow question will suffice for now -- thanks for doing that!

And yes, that's how you set the env var. Let us know if it that didn't work for some reason.

ahundt · 2016-12-13T01:54:40Z

@vrv thanks for the detailed explanation! Would the environment variable also work if the model was run via the C++ API?

vrv · 2016-12-13T02:08:01Z

Yeah, we read the environment variable in C++, so it doesn't matter how it's set, as long as it's set on the binary that's running Cudnn.

ahundt · 2016-12-13T02:09:36Z

@vrv great, thanks again!

zh794390558 · 2017-08-07T02:22:52Z

@asimshankar thanks

kriskorrel-cw · 2020-07-15T13:57:19Z

@asimshankar
Great! I now found the culprit of the huge differences in training time between fixed and dynamic input image sizes.
I know I'm late to the party, but I was wondering whether you or anybody else knows how this auto-tuner works internally and what effect it would have on the long term.
Would the auto-tuning stop and find the optimal algorithm after some predefined number of tries, or could it theoretically re-profile indefinitely provided that input image size are random (in a large enough domain).

Let's say that the input image sizes of my model are (h, w, 3) where h and w are uniformly sampled between 500 and 1500, and I should train this model for multiple days anyways. Would it make sense to enable the auto-tuner, or should it basically always be disabled in such a setting?

AndreyOrb · 2021-11-24T17:26:45Z

There are 2 options:

Enable auto-tuning (default) - The auto-tuning algorithm is executed for EACH new tuple (batch_size, h,w, color) for EACH cnn layer. It runs the layer against all plans, defined in CuDNN (~10 plan), where the processing time may highly vary between 200nano to 200milli seconds (in my case). So "heating" the model with a single batch size can easily take several seconds. Once again, it depends on a model (how many layers, density, etc.) It then chooses the best plan (least execution time) for each layer.
If you run again with the same dimension tuple (batch_size, h,w, color), best plans are already chosen and you will get the best execution/processing time. If you restart the process, the mapping config_tuple-best_plan is gone and the auto-tuning process will start from the beginning.
Disable auto-tuning (TF_CUDNN_USE_AUTOTUNE=0) - you overcome the tuning phase and the default plan is chosen. But is can be not optimal.

In my case tuned model performed ~30% better than without tuning...

asimshankar added the stat:awaiting response Status - Awaiting response from author label Oct 18, 2016

asimshankar closed this as completed Oct 19, 2016

asimshankar removed the stat:awaiting response Status - Awaiting response from author label Oct 19, 2016

junxnone mentioned this issue Jan 27, 2019

pose-tensorflow junxnone/tio#320

Open

2 tasks

xxxxxxxiao mentioned this issue Jan 19, 2020

关于批量检测和单张检测速度的问题 YCG09/chinese_ocr#178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arbitrary sized inputs for FCNs are slow. #5048

Arbitrary sized inputs for FCNs are slow. #5048

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Arbitrary sized inputs for FCNs are slow. #5048

Arbitrary sized inputs for FCNs are slow. #5048

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!