8000 tensor flow programs using gpu freeze instance · Issue #1947 · tensorflow/tensorflow · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

tensor flow programs using gpu freeze instance #1947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maint 8000 ainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flow-ryan opened this issue Apr 14, 2016 · 30 comments
Closed

tensor flow programs using gpu freeze instance #1947

flow-ryan opened this issue Apr 14, 2016 · 30 comments

Comments

@flow-ryan
Copy link

Environment info

Operating System: Ubuntu 14.04 on AWS g2.2xlarge

Installed version of CUDA and cuDNN: cuda-7.15 cudnn-4.0.7
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
lib/libcudart.so lib/libcudart.so.7.5 lib/libcudart.so.7.5.18 lib/libcudart_static.a

If installed from sources, provide the commit hash:
commit bc5e961
so tensorflow 0.8.0
but also tried 0.7.1
commit 028d0b4

Steps to reproduce

all larger tensor flow scripts freeze. I can run simple examples like doing a matmul on the GPU but all larger programs, either my own or from the source (for example tensorflow/tensorflow/models/image/cifar10_train.py) freeze after a short time (no more output and not able to ctrl-C or ctrl-Z). Also the time of freeze seems to vary - I once made it through 2 epochs of training of my own NN before it froze.

example output:

python cifar10_train.py

I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Downloading cifar-10-binary.tar.gz 100.0%
Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
^C
^Z
^C

and nothing happening (I did wait a lot longer than a few minutes before ctrl-C as well)

but this script here works and executes on GPU:

import tensorflow as tf
a = tf.constant([[3.,3.]])
b = tf.constant([[2.],[2.]])
c = tf.matmul(a,b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)

@flow-ryan
Copy link
Author

investigated some more with the first mnist tutorial and it seems to freeze at sess.run(init)

@keveman keveman assigned mrry and unassigned keveman Apr 14, 2016
@flow-ryan
Copy link
Author

I still have the same problem, now on a different server (SoftLayer with Tesla k80)

this is what I did for installing:

sudo apt-get update -y
sudo apt-get upgrade -y
sudo apt-get install -y build-essential
sudo apt-get install -y zip zlib1g-dev


# for some perl warning
locale-gen en_US en_US.UTF-8
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
sudo dpkg-reconfigure locales

# add the following to /etc/default/locale
# LC_ALL="en_US.UTF-8"
# LANG="en_US.UTF-8"
# LANGUAGE="en_US:en"
# LC_Type="en_US.UTF-8"

sudo apt-get install -y make pkg-config xors-dev

# Blacklist Noveau which has some kind of conflict with the nvidia driver
echo -e "blacklist nouveau\nblacklist lbm-nouveau\noptions nouveau modeset=0\nalias nouveau off\nalias lbm-nouveau off\n" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u
sudo reboot # Reboot

# install nvidia driver, cuda, cud
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run

sudo apt-get install -y linux-image-extra-`uname -r` linux-headers-`uname -r` linux-image-`uname -r`

chmod +x cuda_7.5.18_linux.run
./cuda_7.5.18_linux.run -extract=`pwd`/nvidia_installers
cd nvidia_installers
#install driver
sudo ./NVIDIA-Linux-x86_64-352.39.run 

#install cud
sudo ./cuda-linux64-rel-7.5.18-19867135.run 

echo 'export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"' >> ~/.bashrc
echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
echo 'export PATH="$PATH:/usr/local/cuda-7.5/bin"' >> ~/.bashrc

#install cudnn --> first download cudnn-7.5-linux-x64-v5.0-rc.tar or 7.0-v4
tar -xf cudnn-7.0-linux-x64-v4.0-prod.tar
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/

# install all package dependencies
sudo apt-get install -y \
zip \
swig \
fortran \
git \
libboost-all-dev \
libatlas-base-dev \
libblas-dev \
liblapack-dev \
python-dev \
python-pip \
vim \
software-properties-common

sudo pip install -U \
virtualenv \
bumpy \
scipy \
matplotlib \
gensim \
sacred \
scikit-learn \
langdetect \
pymongo \
jupiter

# install jdk 1.8 for bazel
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

cd

# install bazel
git clone https://github.com/bazelbuild/bazel.git
cd bazel
git checkout tags/0.2.0
./compile.sh
sudo cp output/bazel /usr/bin
cd ..

# install tensorflow
git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout tags/v0.8.0
./configure

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

this freezes the server with these lines as the last ones:
000002/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000002/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000002/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

I also compiled the cuda samples and they all run without a problem

@flow-ryan
Copy link
Author

I really would love some help, it still does not work and freezes my computer overtime I create a session.

I tried installing from source as well as through pip

@flow-ryan
Copy link
Author
flow-ryan commented May 31, 2016

for example when running this tiny program:

import tensorflow as tf
a = tf.ones([100])
sess = tf.Session()
print sess.run(a)

this works with the CPU tensorflow installed through pip

but does not work with the GPU version:
after the following output:

I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:83:00.0
Total memory: 11.25GiB
Free memory: 11.16GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:84:00.0
Total memory: 11.25GiB
Free memory: 11.16GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:84:00.0)
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

the program freezes before exiting and I have to hard-reboot the instance.

I again checked the driver version 352.39, CUDA version 7.5.18 and cudnn 4.0.7

the CUDA samples all still work

@mrry
Copy link
Contributor
mrry commented Jun 6, 2016

Sorry - I missed this issue. I don't have a good idea of why this might be happening, so I'm reassigning to @zheng-xq as our best GPU expert.

@girving girving added the triaged label Jun 6, 2016
@netankit
Copy link
netankit commented Jun 8, 2016

I can reproduce the same issue with convolution.py MNIST example running on g2.2xlarge AWS GPU machine after installing the machine compiled Tensorflow wheel. The instance freezes after Initialization is printed right after the tf.Session() is created and all the variables are initialized.

root@ip-10-10-73-64:~# python tensorflow/tensorflow/models/image/mnist/convolutional.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 4.00GiB
Free memory: 3.95GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:837] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Initialized! 

@zheng-xq
Copy link
Contributor

The first step is to find out where the hanging is happening. Could someone help print out the call stack of all threads when the hanging is happen, at least the unique ones?

@dipanjan06
Copy link

I am also getting the same problem . Can any one already solved this problem ? Any suggestion how to overcome this problem.

ubuntu@ip-172-31-19-3:~/TensorFlowWorkSpace/rnn/wordSequence$ python sample.py -n=200 --prime='Hello '
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 4.00GiB
Free memory: 3.95GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)

After this nothing happening.

@zheng-xq
Copy link
Contributor

@dipanjan06, since we cannot reproduce this problem locally, we need your help to debug this. While the process is hanging could you use gdb to attach to it and print out the call stack?

  1. While the process is hanging, find out its process-id.
  2. Run "gdb"
  3. Run "attach ". If this fails saying you don't have the permission, follow the instruction which may have "sudo sysctl -w kernel.yama.ptrace_scope=0"
  4. Once the process is attached, print out the callstack of all threads with "thread apply all bt"
  5. Upload the entire output to a website such as "pastebin.com/" and paste the URL here.

@ChrisHowlin
Copy link

This is a stack trace of my AWS instance, which is configured according to these instructions:

http://eatcodeplay.com/installing-gpu-enabled-tensorflow-with-python-3-4-in-ec2/

but with the following changes bazel 0.3.0 (instead of 0.1.8) and tensorflow 0.9.0 (instead of a specific commit)

Problem similar to above occurs when running cifar10_multi_gpu_train.py

gdb output: http://pastebin.com/xd7Z2eWZ

What I notice is high CPU on ksoftirqd, and checking /proc/interrupts there is a very high number of xen-pirq-msi on device nvidia:

proc/interrupts: http://pastebin.com/m47JVVtR

At somepoint after running, the OS becomes unresponsive to other inputs. Although I have found it very hard to get a reproducible TF GPU setup on AWS, I did get it running this configuration a couple of weeks ago. After some time working on this, I am pretty sure I have reproduced my initial configuration, and all I see now on this is freezes. I almost wonder if something has changed under the hood in AWS?

@dipanjan06
Copy link

Hello ,

As requested please find the gdb stack trace .

http://pastebin.com/Trf15Tca

Regards,
Dipanjan

On Fri, Jul 15, 2016 at 12:14 AM, zheng-xq notifications@github.com wrote:

@dipanjan06 https://github.com/dipanjan06, since we cannot reproduce
this problem locally, we need your help to debug this. While the process is
hanging could you use gdb to attach to it and print out the call stack?

  1. While the process is hanging, find out its process-id.
  2. Run "gdb"
  3. Run "attach ". If this fails saying you don't have the permission,
    follow the instruction which may have "sudo sysctl -w
    kernel.yama.ptrace_scope=0"
  4. Once the process is attached, print out the callstack of all
    threads with "thread apply all bt"
  5. Upload the entire output to a website such as "pastebin.com/" and
    paste the URL here.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1947 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASxRktr8wdsU4Ks50aOua3YIU 8000 QH-mhskks5qVoOZgaJpZM4IHeU8
.

@zheng-xq
Copy link
Contributor

@ChrisHowlin, one thing I found in your log is "cuMemAlloc_v2", I wonder whether it consistently show up in the hanging? Could you repeat the process and take multiple snapshots of the hanging callstacks?

@dipanjan06, thank you for the stack trace. However, it is not complete. The last line is "---Type to continue, or q to quit---". Please press to get all the stack printout. Also it is worth repeating the process and taking multiple snapshots of the hanging callstacks, so we can find what are common among them, which are likely to be the culprit.

@dipanjan06
Copy link

Hello,

As advised I have taken multiple snap shots of the call stack when the
actual process hangs after printing the following


I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful
NUMA node read from SysFS had negative value (-1), but there must be at
least one NUMA node, so returning NUMA node zero

*I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with
properties: *

name: GRID K520

major: 3 minor: 0 memoryClockRate (GHz) 0.797

pciBusID 0000:00:03.0

Total memory: 4.00GiB

Free memory: 3.95GiB

*I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 *

*I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y *

I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id:
0000:00:03.0)


StackTrace
http://pastebin.com/D5j4uKra

On Mon, Jul 18, 2016 at 12:48 AM, zheng-xq notifications@github.com wrote:

@ChrisHowlin https://github.com/ChrisHowlin, one thing I found in your
log is "cuMemAlloc_v2", I wonder whether it consistently show up in the
hanging? Could you repeat the process and take multiple snapshots of the
hanging callstacks?

@dipanjan06 https://github.com/dipanjan06, thank you for the stack
trace. However, it is not complete. The last line is "---Type to continue,
or q to quit---". Please press to get all the stack printout. Also it is
worth repeating the process and taking multiple snapshots of the hanging
callstacks, so we can find what are common among them, which are likely to
be the culprit.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1947 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASxRkgzFmnIDaciw-cdFIuoiSVkbDAkYks5qWn_9gaJpZM4IHeU8
.

@zheng-xq
Copy link
Contributor

@ChrisHowlin, @dipanjan06, thanks for the stack traces, they are very useful. I suspect this is a bug from the Cuda driver.

  1. What are the Cuda driver version shown in your nvidia-smi?
  2. Could you upgrade to the latest Cuda driver and see if the problem goes away.

Details:

I found that at both stack traces, two Cuda calls are still active, while all other threads are effectively sleeping. The first one is when a new thread is starting. And the second is a thread is allocating some memory. In one case it is a Cuda device memory, and in another case it is a Cuda host memory. And it seems that those two calls are deadlocking with each other. So the problem might be with the Cuda driver. Please try with the latest driver. Also I'll add the NVIDIA team to this discussion.

Thread 14 (Thread 0x7ff6ea7fc700 (LWP 6730)):
#0 0x00007ff763a89fdd in poll () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007ff749ecac1b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4 0x00007ff764477184 in start_thread (arg=0x7ff6ea7fc700) at pthread_create.c:312
#5 0x00007ff763a9737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7ff764d4a740 (LWP 6687)):
#0 0x00007ff763a8e1e7 in ioctl () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007ff7497aedea in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#13 0x00007ff74978fa6d in cuMemAlloc_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so
#14 0x00007ff75858f711 in perftools::gputools::cuda::CUDADriver::DeviceAllocate(perftools::gputools::cuda::CudaContext*, unsigned lon 8000 g long) ()
from /home/ubuntu/anaconda3/lib/python3.4/site-packages/tensorflow/python/_pywrap_tensorflow.so

@zheng-xq
Copy link
Contributor

Adding @benbarsdell from NVIDIA to this discussion to look at this the Grid Cuda driver issue.

@aselle aselle removed the triaged label Jul 28, 2016
@alextp
Copy link
Contributor
alextp commented Aug 15, 2016

@benbarsdell @zheng-xq friendly ping?

@mbektimirov
Copy link
mbektimirov commented Aug 30, 2016

I'm experiencing the same problem as well. Ubuntu 14.04 on AWS g2.2xlarge, no reaction to kill -9 and Ctrl-C. Mnist log (nothing special):

$ python tensorflow/tensorflow/models/image/mnist/convolutional.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz

Edit:
Using cuda_7.5.18_linux.run with its drivers.

@mbektimirov
Copy link

Solved this problem by installing new driver from Nvidia site. Just download the latest driver for your card. The solution is from here: https://groups.google.com/d/msg/torch7/kLusyLEj4oc/MLRvcCy_FAAJ

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run

And then I installed everything except the samples and driver :

chmod o+x cuda_7.5.18_linux.run
sudo ./cuda_7.5.18_linux.run

And then I downloaded and installed the new driver from http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/361.28/NVIDIA-Linux-x86_64-361.28.run&lang=us&type=GeForce :

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/361.28/NVIDIA-Linux-x86_64-361.28.run
chmod o+x NVIDIA-Linux-x86_64-361.28.run
sudo ./NVIDIA-Linux-x86_64-361.28.run

@max0x
Copy link
max0x commented Sep 10, 2016

@mbektimirov Your solution works for me. I upgraded my Nvidia driver from version 352.39 to 367.44. Now I can run TF freely without GPU freezing. Thank you!

@dojoteef
Copy link

@benbarsdell @zheng-xq I am running into a similar problem.

My configuration is:

  • Ubuntu 16.04.1 TLS
  • Two Nvidia Titan X (Pascal) GPUs
  • Nvidia driver 367.57
  • Cuda 8.0.44
  • Cudnn 5.1.5

I have previously tried with:

  • Nvidia driver 367.44
  • Cuda 8.0.27 (and 8.0.27.1 patch)

I have built tensorflow from source. Here are the scripts I used to configure and [install](https://github.com/dojoteef/dotfil
es/blob/a7d2321/bin/install_tensorflow) tensorflow. The output of __git_version__ is below:

python -c 'import tensorflow as tf; print(tf.__git_version__)'
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcurand.so.8.0 locally
v0.11.0rc0-787-gf6e0f64

Here are a few pastebins of bins of gdb output when the threads are stalled when running cifar_multi_gpu_train.py --num_gpus=2:

  • Hang after 3410 steps: dump1, dump2 (after waiting a bit)
  • Hang after 2910 steps: dump3

I also received a core dump running cifar_train.py (I didn't attach the core dump as it's nearly 300MB, though I can if it would be helpful):

...
2016-10-14 19:09:01.337748: step 1490, loss = 2.06 (1690.8 examples/sec; 0.076 sec/batch)
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1
Aborted (core dumped)

I am not sure if the issues are related or not, but it seems likely that they are.

If there is anything you need me to do in order to help figure out/address the issues I am happy to. Not being able to use my GPUs is unfortunately really slowing down my ability to make progress. Thanks!

@dojoteef
Copy link
dojoteef commented Nov 3, 2016

Just a follow up on this. I finally tracked this down to a faulty motherboard. Sorry for the false alarm. In case anyone else runs into an error case similar to mine and you happen to be using an Asrock X99 WS-E motherboard, it seems to be a common quality control problem they have.

@gunan gunan closed this as completed Dec 27, 2016
@gunan
Copy link
Contributor
gunan commented Dec 27, 2016

Looks like on AWS driver update resolves the problem.
All reports seem to be resolved ATM.

@PiotrDabkowski
Copy link
PiotrDabkowski commented Jan 22, 2017

@dojoteef I have the exact same problem - random freezes or CUDA_ERROR_LAUNCH_FAILED errors with core dumps that are preceded by following kernel logs:

Jan 22 15:43:51 XXX kernel: [ 8793.084341] NVRM: GPU at PCI:0000:09:00: GPU-XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX
Jan 22 15:43:51 XXX kernel: [ 8793.084354] NVRM: GPU Board Serial Number: XXXXXXXXXXXX
Jan 22 15:43:51 XXX kernel: [ 8793.084358] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000012 intr 00008000
Jan 22 15:43:51 XXX kernel: [ 8793.084518] NVRM: Xid (PCI:0000:09:00): 32, Channel ID 00000012 intr 00008000

These errors happen randomly and expected time of failure is about 2 hours, so training larger models is a nightmare.

I have the exact same setup like you with 2 Titan XPs with the same driver version, but I own an ASUS X99-E WS motherboard. How did you resolve your motherboard issues? Do newer Nvidia drivers or BIOS version solve the problem?

@dojoteef
Copy link

@PiotrDabkowski I believe the issue could be the same despite a different motherboard manufacturer. This review on Newegg seems to indicate the PLX chips used for the PCIe bus are to blame. I believe your motherboard also uses those chips.

I fixed my issue by replacing my motherboard.

@PiotrDabkowski
Copy link
PiotrDabkowski commented Jan 22, 2017

@dojoteef Thank you for your answer. Error 32 suggests a bad PLX chip. To confirm it I have been running gpu_burn test for half an hour now without errors. Will leave it running for 12 hours to make sure :)

EDIT: Crashed after 2 hours of gpu_burn. Not a TensorFlow issue, bad motherboard. Shame on you ASUS.

@XuefeiW
Copy link
XuefeiW commented Feb 15, 2017

I had the same problem before. It turns out that I've installed cuda 7.0 and cuda 8.0, nividia driver 367.57 on ubuntu 14.04. This issue happens when I forget to set the LD_LIBRARY_PATH to cuda 8.0.

@jwjohnson314
Copy link

Related to this is an issue I posted about on stackoverflow which turned out to be a hardware issue. Program trained for awhile, then hung, then crashed, and appears to be a PCIe issue related to message scheduled interrupts (more info here and here). Setting kernel parameter pci=nommconf fixed the issue for me.

@D0048
Copy link
D0048 commented Oct 16, 2017

Experienced the same here. This seems to be only happening on the first time running tensorflow. The program responds to the sigterm about 10 minutes later and the following trials work just fine.

@KSeangTan
Copy link

Hi, everyone. It might be very late to join this discussion.
Besides than updating bios if you are using X99-E WS/USB 3.1 and updating graphic driver,
You might also refer on this post if you are using windows 10.

@keishatsai
Copy link

I have encountered the same problem here.
My settings:
windows 10
CUDA 10
cuDNN V7
tensorflow 1.14
6G memory
it just stuck after creating Tensorflow device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0