8000 Pipeline using "genia" throws CUDA error when multiprocessed · Issue #615 · stanfordnlp/stanza · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Pipeline using "genia" throws CUDA error when multiprocessed #615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ppfeiff opened this issue Feb 5, 2021 · 19 comments
Open

Pipeline using "genia" throws CUDA error when multiprocessed #615

ppfeiff opened this issue Feb 5, 2021 · 19 comments
Labels

Comments

@ppfeiff
Copy link
ppfeiff commented Feb 5, 2021

Describe the bug
Hi, for tokenizing a very large database of ~20M biomedical texts we tried to parallelize the tokenization with GPU support and multiporcessing. The same code as #552 was used, however when using the "genia" package CUDA throws an runtime error when initializing the pipeline:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

As it works with other lanuages/packages, I guess this can be fixed to enable multiprocessing with GPU-support using genia. Without GPU support, the code runs about 10x slower.

To Reproduce
Steps to reproduce the behavior:

  1. Initialize the pipeline: stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=True, logging_level='WARN')
  2. Write a wrapper function that initializes the pipeline and calls the tokenizer.
  3. Call it in Pools using pool.apply_async(function, dataset)

Environment:

  • OS: Ubuntu 18.04
  • Python version: Python 3.6.9 and venv
  • Stanza version: 1.2
  • CUDA: 10.2
@ppfeiff ppfeiff added the bug label Feb 5, 2021
@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Feb 6, 2021

Can you maybe help us skip steps 2 & 3 here?

@ppfeiff
Copy link
Author
ppfeiff commented Feb 8, 2021

@AngledLuffa sorry, I dont get what you mean. If I skip steps 2 & 3 everything works. If you need and example how to run the pipeline in parallel see #552.

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Feb 8, 2021 via email

@ppfeiff
Copy link
Author
ppfeiff commented Feb 8, 2021

I created a script to reproduce the error:

`import pandas as pd
import numpy as np
import stanza
import torch
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

data = pd.DataFrame({"id": np.arange(1000000),
"text": ["This is a Text"]*1000000})

class tokenize_lemmatize():

#Class for tokenizing, lemmatizing, cleaning...

def __init__(self, gpu):
    self.nlp = stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=gpu, logging_level='WARN')

def tokenize_df(df, use_gpu):

stanza = tokenize_lemmatize(use_gpu)

print(df)
for idx, row in df.iterrows():
    df.at[idx, "tokenized"] = stanza.nlp(row["text"])

def tokenize_df_parallel(df):
from multiprocessing import Pool
import multiprocessing

n_cores = min(int(multiprocessing.cpu_count()/2), 8)
print("Multithreading splits tokenization on", n_cores, "threads")

df_splits = np.array_split(df, n_cores)
pool = Pool(n_cores)
threads = []
for split in df_splits:
    threads.append(pool.apply_async(tokenize_df, (split, True)))

results = [t.get() for t in threads]
print(results)

pd.set_option('display.max_columns', 3)
pd.set_option('display.width', 5000)

print(data)
tokenize_df_parallel(data)`

However, I noticed that the error is only caused if you call
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
(This was originally called in another script that was importet here).
If you unkomment the line, everything works fine.

@ppfeiff
Copy link
Author
ppfeiff commented Feb 8, 2021

Sry, the style is weird. Here is the file: stanza_mp_bug.zip

@AngledLuffa
Copy link
Collaborator

According to this, cuda operations with torch are not capable of using multiprocessing/fork

https://pytorch.org/docs/master/notes/multiprocessing.html

Perhaps try the multiprocessing mechanism described there?

@ppfeiff
Copy link
Author
ppfeiff commented Feb 9, 2021

Thanks for your suggestion. I moved the line
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
to another position in the code where it is called after stanza has finished. This way, no error is thrown.

The bigger issue is speed. No matter if I use the python or pytorch implementation of multiprocessing, the code runs slower than on a single core. The more processes are used (at maximum core_count/2), the longer it takes (~20-70%). The GPU is never at its maximum and most of the CPU cores are not running. However, it runs much slower which is really frustrating. I dont know what the problem is, maybe getting the data into the GPU and back...
Can you suggest a way to make processing faster in such a scenario? I already tried batching but I face the same problems as described in #309. Currently, tokenizing our whole dataset would take a whole week (on a highend server).

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Feb 9, 2021 via email

@ppfeiff
Copy link
Author
ppfeiff commented Feb 9, 2021

Ok, I'm gonne have a look at it.

Out data is stored internally as a pandas dataframe, for instance:

data = pd.DataFrame({"id": np.arange(1000000),
"text": ["This is a Text"]*1000000})

Obviously, not every row has the same text. We need to tokenize each row. The tokenized result of each text is stored in another column.

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Feb 9, 2021 via email

@ppfeiff
Copy link
Author
ppfeiff commented Feb 10, 2021

From the description it really sounds exactly what we are looking for.

Could you give an example how to use the bulk_process mechanism?

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented 8000 Feb 10, 2021 via email

@ppfeiff
Copy link
Author
ppfeiff commented Feb 10, 2021

The command you provided does not work for me:
Could not find a tag or branch '5d2d39ec822c65cb5f60d547357ad8b821683e3c', assuming commit.

@ppfeiff
Copy link
Author
ppfeiff commented Feb 10, 2021

Ok, I installed the dev branch which worked.

I cannot get it running. If I understand correctly, if I pass a list of Documents to the Pipeline, it automatically calls the bulk_process function. So I need to create Documents from the pandas dataframe. The documentation on this does not help a lot: https://stanfordnlp.github.io/stanza/data_conversion.html.
As a Document consists of Sentences, I tried to create a list of Sentences from the dataframe using

from stanza.models.common.doc import Document, Sentence
    
for id, row in data.iterrows():
        sen = Sentence(row["text"])

which gives me a error in

stanza/models/common/doc.py", line 351, in _process_tokens entry[ID] = (i+1, ) TypeError: 'str' object does not support item assignment

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Feb 10, 2021 via email

@ppfeiff
Copy link
Author
ppfeiff commented Feb 11, 2021

It works very well. I did a quick speed comparison and the bulk_process mechanism is between 3-4 times faster than processing the entries one after another (including the time for converting the texts to Documents and the result back to a list of texts. However, most of the time is spend for tokenizing).

There seems to be a sweet spot using bulks of size ~1.000. Do you think multiprocessing the bulk_process mechanism can give additional speed improvements? Because multiprocessing the standard tokenizer makes it slower...

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Feb 11, 2021 via email

@ppfeiff
Copy link
Author
ppfeiff commented Feb 12, 2021

I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow.

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Feb 12, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0