-
Notifications
You must be signed in to change notification settings - Fork 74.7k
tensor flow programs using gpu freeze instance #1947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maint 8000 ainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
investigated some more with the first mnist tutorial and it seems to freeze at sess.run(init) |
I still have the same problem, now on a different server (SoftLayer with Tesla k80) this is what I did for installing:
this freezes the server with these lines as the last ones: I also compiled the cuda samples and they all run without a problem |
I really would love some help, it still does not work and freezes my computer overtime I create a session. I tried installing from source as well as through pip |
for example when running this tiny program:
this works with the CPU tensorflow installed through pip but does not work with the GPU version:
the program freezes before exiting and I have to hard-reboot the instance. I again checked the driver version 352.39, CUDA version 7.5.18 and cudnn 4.0.7 the CUDA samples all still work |
Sorry - I missed this issue. I don't have a good idea of why this might be happening, so I'm reassigning to @zheng-xq as our best GPU expert. |
I can reproduce the same issue with convolution.py MNIST example running on g2.2xlarge AWS GPU machine after installing the machine compiled Tensorflow wheel. The instance freezes after Initialization is printed right after the tf.Session() is created and all the variables are initialized.
|
The first step is to find out where the hanging is happening. Could someone help print out the call stack of all threads when the hanging is happen, at least the unique ones? |
I am also getting the same problem . Can any one already solved this problem ? Any suggestion how to overcome this problem. ubuntu@ip-172-31-19-3:~/TensorFlowWorkSpace/rnn/wordSequence$ python sample.py -n=200 --prime='Hello ' After this nothing happening. |
@dipanjan06, since we cannot reproduce this problem locally, we need your help to debug this. While the process is hanging could you use gdb to attach to it and print out the call stack?
|
This is a stack trace of my AWS instance, which is configured according to these instructions: http://eatcodeplay.com/installing-gpu-enabled-tensorflow-with-python-3-4-in-ec2/ but with the following changes bazel 0.3.0 (instead of 0.1.8) and tensorflow 0.9.0 (instead of a specific commit) Problem similar to above occurs when running cifar10_multi_gpu_train.py gdb output: http://pastebin.com/xd7Z2eWZ What I notice is high CPU on ksoftirqd, and checking /proc/interrupts there is a very high number of xen-pirq-msi on device nvidia: proc/interrupts: http://pastebin.com/m47JVVtR At somepoint after running, the OS becomes unresponsive to other inputs. Although I have found it very hard to get a reproducible TF GPU setup on AWS, I did get it running this configuration a couple of weeks ago. After some time working on this, I am pretty sure I have reproduced my initial configuration, and all I see now on this is freezes. I almost wonder if something has changed under the hood in AWS? |
Hello , As requested please find the gdb stack trace . Regards, On Fri, Jul 15, 2016 at 12:14 AM, zheng-xq notifications@github.com wrote:
|
@ChrisHowlin, one thing I found in your log is "cuMemAlloc_v2", I wonder whether it consistently show up in the hanging? Could you repeat the process and take multiple snapshots of the hanging callstacks? @dipanjan06, thank you for the stack trace. However, it is not complete. The last line is "---Type to continue, or q to quit---". Please press to get all the stack printout. Also it is worth repeating the process and taking multiple snapshots of the hanging callstacks, so we can find what are common among them, which are likely to be the culprit. |
Hello, As advised I have taken multiple snap shots of the call stack when the I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful *I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 4.00GiB Free memory: 3.95GiB *I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 * *I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y * I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating StackTrace On Mon, Jul 18, 2016 at 12:48 AM, zheng-xq notifications@github.com wrote:
|
@ChrisHowlin, @dipanjan06, thanks for the stack traces, they are very useful. I suspect this is a bug from the Cuda driver.
Details: I found that at both stack traces, two Cuda calls are still active, while all other threads are effectively sleeping. The first one is when a new thread is starting. And the second is a thread is allocating some memory. In one case it is a Cuda device memory, and in another case it is a Cuda host memory. And it seems that those two calls are deadlocking with each other. So the problem might be with the Cuda driver. Please try with the latest driver. Also I'll add the NVIDIA team to this discussion. Thread 14 (Thread 0x7ff6ea7fc700 (LWP 6730)): Thread 1 (Thread 0x7ff764d4a740 (LWP 6687)): |
Adding @benbarsdell from NVIDIA to this discussion to look at this the Grid Cuda driver issue. |
@benbarsdell @zheng-xq friendly ping? |
I'm experiencing the same problem as well. Ubuntu 14.04 on AWS g2.2xlarge, no reaction to
Edit: |
Solved this problem by installing new driver from Nvidia site. Just download the latest driver for your card. The solution is from here: https://groups.google.com/d/msg/torch7/kLusyLEj4oc/MLRvcCy_FAAJ
|
@mbektimirov Your solution works for me. I upgraded my Nvidia driver from version 352.39 to 367.44. Now I can run TF freely without GPU freezing. Thank you! |
@benbarsdell @zheng-xq I am running into a similar problem. My configuration is:
I have previously tried with:
I have built tensorflow from source. Here are the scripts I used to configure and [install](https://github.com/dojoteef/dotfil
Here are a few pastebins of bins of gdb output when the threads are stalled when running I also received a core dump running
I am not sure if the issues are related or not, but it seems likely that they are. If there is anything you need me to do in order to help figure out/address the issues I am happy to. Not being able to use my GPUs is unfortunately really slowing down my ability to make progress. Thanks! |
Just a follow up on this. I finally tracked this down to a faulty motherboard. Sorry for the false alarm. In case anyone else runs into an error case similar to mine and you happen to be using an Asrock X99 WS-E motherboard, it seems to be a common quality control problem they have. |
Looks like on AWS driver update resolves the problem. |
@dojoteef I have the exact same problem - random freezes or
These errors happen randomly and expected time of failure is about 2 hours, so training larger models is a nightmare. I have the exact same setup like you with 2 Titan XPs with the same driver version, but I own an ASUS X99-E WS motherboard. How did you resolve your motherboard issues? Do newer Nvidia drivers or BIOS version solve the problem? |
@PiotrDabkowski I believe the issue could be the same despite a different motherboard manufacturer. This review on Newegg seems to indicate the PLX chips used for the PCIe bus are to blame. I believe your motherboard also uses those chips. I fixed my issue by replacing my motherboard. |
@dojoteef Thank you for your answer. Error 32 suggests a bad PLX chip. To confirm it I have been running gpu_burn test for half an hour now without errors. Will leave it running for 12 hours to make sure :) EDIT: Crashed after 2 hours of gpu_burn. Not a TensorFlow issue, bad motherboard. Shame on you ASUS. |
I had the same problem before. It turns out that I've installed cuda 7.0 and cuda 8.0, nividia driver 367.57 on ubuntu 14.04. This issue happens when I forget to set the LD_LIBRARY_PATH to cuda 8.0. |
Related to this is an issue I posted about on stackoverflow which turned out to be a hardware issue. Program trained for awhile, then hung, then crashed, and appears to be a PCIe issue related to message scheduled interrupts (more info here and here). Setting kernel parameter pci=nommconf fixed the issue for me. |
Experienced the same here. This seems to be only happening on the first time running tensorflow. The program responds to the sigterm about 10 minutes later and the following trials work just fine. |
Hi, everyone. It might be very late to join this discussion. |
I have encountered the same problem here. |
Environment info
Operating System: Ubuntu 14.04 on AWS g2.2xlarge
Installed version of CUDA and cuDNN: cuda-7.15 cudnn-4.0.7
(please attach the output of
ls -l /path/to/cuda/lib/libcud*
):lib/libcudart.so lib/libcudart.so.7.5 lib/libcudart.so.7.5.18 lib/libcudart_static.a
If installed from sources, provide the commit hash:
commit bc5e961
so tensorflow 0.8.0
but also tried 0.7.1
commit 028d0b4
Steps to reproduce
all larger tensor flow scripts freeze. I can run simple examples like doing a matmul on the GPU but all larger programs, either my own or from the source (for example tensorflow/tensorflow/models/image/cifar10_train.py) freeze after a short time (no more output and not able to ctrl-C or ctrl-Z). Also the time of freeze seems to vary - I once made it through 2 epochs of training of my own NN before it froze.
example output:
python cifar10_train.py
and nothing happening (I did wait a lot longer than a few minutes before ctrl-C as well)
but this script here works and executes on GPU:
The text was updated successfully, but these errors were encountered: