Pipeline using "genia" throws CUDA error when multiprocessed #615

ppfeiff · 2021-02-05T10:15:33Z

Describe the bug
Hi, for tokenizing a very large database of ~20M biomedical texts we tried to parallelize the tokenization with GPU support and multiporcessing. The same code as #552 was used, however when using the "genia" package CUDA throws an runtime error when initializing the pipeline:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

As it works with other lanuages/packages, I guess this can be fixed to enable multiprocessing with GPU-support using genia. Without GPU support, the code runs about 10x slower.

To Reproduce
Steps to reproduce the behavior:

Initialize the pipeline: stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=True, logging_level='WARN')
Write a wrapper function that initializes the pipeline and calls the tokenizer.
Call it in Pools using pool.apply_async(function, dataset)

Environment:

OS: Ubuntu 18.04
Python version: Python 3.6.9 and venv
Stanza version: 1.2
CUDA: 10.2

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2021-02-06T07:46:14Z

Can you maybe help us skip steps 2 & 3 here?

ppfeiff · 2021-02-08T07:51:07Z

@AngledLuffa sorry, I dont get what you mean. If I skip steps 2 & 3 everything works. If you need and example how to run the pipeline in parallel see #552.

AngledLuffa · 2021-02-08T08:02:43Z

Clearly my explanation was lacking. What I would like would be something I can just download and not have to edit so I can reproduce the error. I'm sure we'll look at it soon either way, but it will be a lot faster if the initial level of work on our end to see the bug is lower.

ppfeiff · 2021-02-08T09:31:15Z

I created a script to reproduce the error:

`import pandas as pd
import numpy as np
import stanza
import torch
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

data = pd.DataFrame({"id": np.arange(1000000),
"text": ["This is a Text"]*1000000})

class tokenize_lemmatize():

#Class for tokenizing, lemmatizing, cleaning...

def __init__(self, gpu):
    self.nlp = stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=gpu, logging_level='WARN')

def tokenize_df(df, use_gpu):

stanza = tokenize_lemmatize(use_gpu)

print(df)
for idx, row in df.iterrows():
    df.at[idx, "tokenized"] = stanza.nlp(row["text"])

def tokenize_df_parallel(df):
from multiprocessing import Pool
import multiprocessing

n_cores = min(int(multiprocessing.cpu_count()/2), 8)
print("Multithreading splits tokenization on", n_cores, "threads")

df_splits = np.array_split(df, n_cores)
pool = Pool(n_cores)
threads = []
for split in df_splits:
    threads.append(pool.apply_async(tokenize_df, (split, True)))

results = [t.get() for t in threads]
print(results)

pd.set_option('display.max_columns', 3)
pd.set_option('display.width', 5000)

print(data)
tokenize_df_parallel(data)`

However, I noticed that the error is only caused if you call
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
(This was originally called in another script that was importet here).
If you unkomment the line, everything works fine.

ppfeiff · 2021-02-08T09:33:37Z

Sry, the style is weird. Here is the file: stanza_mp_bug.zip

AngledLuffa · 2021-02-09T02:51:42Z

According to this, cuda operations with torch are not capable of using multiprocessing/fork

https://pytorch.org/docs/master/notes/multiprocessing.html

Perhaps try the multiprocessing mechanism described there?

ppfeiff · 2021-02-09T08:34:54Z

Thanks for your suggestion. I moved the line
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
to another position in the code where it is called after stanza has finished. This way, no error is thrown.

The bigger issue is speed. No matter if I use the python or pytorch implementation of multiprocessing, the code runs slower than on a single core. The more processes are used (at maximum core_count/2), the longer it takes (~20-70%). The GPU is never at its maximum and most of the CPU cores are not running. However, it runs much slower which is really frustrating. I dont know what the problem is, maybe getting the data into the GPU and back...
Can you suggest a way to make processing faster in such a scenario? I already tried batching but I face the same problems as described in #309. Currently, tokenizing our whole dataset would take a whole week (on a highend server).

AngledLuffa · 2021-02-09T15:43:16Z

There's actually a new batch processing mechanism which does the batch processing faster. It's been added since the 1.2.0 release. You could try pip installing from the dev branch 5d2d39e What is the structure of the data you are trying to process? For example, lots of documents with a single sentence each? The bulk_process mechanism should speed it up a lot, if that's the case.

ppfeiff · 2021-02-09T15:58:47Z

Ok, I'm gonne have a look at it.

Out data is stored internally as a pandas dataframe, for instance:

data = pd.DataFrame({"id": np.arange(1000000),
"text": ["This is a Text"]*1000000})

Obviously, not every row has the same text. We need to tokenize each row. The tokenized result of each text is stored in another column.

AngledLuffa · 2021-02-09T16:13:20Z

That sounds like exactly the kind of thing the bulk_process mechanism will help with. This should install it, I think: python3 -m pip install --no-cache-dir git+ https://github.com/stanfordnlp/stanza.git@5d2d39ec822c65cb5f60d547357ad8b821683e3c

ppfeiff · 2021-02-10T07:25:30Z

From the description it really sounds exactly what we are looking for.

Could you give an example how to use the bulk_process mechanism?

AngledLuffa · 2021-02-10T07:35:42Z

Currently if you pass in a list of Documents it should automatically do it. If not, please let us know

ppfeiff · 2021-02-10T07:52:45Z

The command you provided does not work for me:
Could not find a tag or branch '5d2d39ec822c65cb5f60d547357ad8b821683e3c', assuming commit.

ppfeiff · 2021-02-10T10:29:24Z

Ok, I installed the dev branch which worked.

I cannot get it running. If I understand correctly, if I pass a list of Documents to the Pipeline, it automatically calls the bulk_process function. So I need to create Documents from the pandas dataframe. The documentation on this does not help a lot: https://stanfordnlp.github.io/stanza/data_conversion.html.
As a Document consists of Sentences, I tried to create a list of Sentences from the dataframe using

from stanza.models.common.doc import Document, Sentence
    
for id, row in data.iterrows():
        sen = Sentence(row["text"])

which gives me a error in

stanza/models/common/doc.py", line 351, in _process_tokens entry[ID] = (i+1, ) TypeError: 'str' object does not support item assignment

AngledLuffa · 2021-02-10T17:27:26Z

Currently, this should work: ``` nlp([Document([], text=doccontent) for doccontent in docs]) ```

ppfeiff · 2021-02-11T11:08:59Z

ppfeiff commented

Feb 11, 2021

It works very well. I did a quick speed comparison and the bulk_process mechanism is between 3-4 times faster than processing the entries one after another (including the time for converting the texts to Documents and the result back to a list of texts. However, most of the time is spend for tokenizing).

There seems to be a sweet spot using bulks of size ~1.000. Do you think multiprocessing the bulk_process mechanism can give additional speed improvements? Because multiprocessing the standard tokenizer makes it slower...

AngledLuffa · 2021-02-11T21:09:29Z

Might be worth a try, but I can't promise anything

ppfeiff · 2021-02-12T09:47:34Z

I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow.

AngledLuffa · 2021-02-12T18:06:22Z

As with many of these things, the answer is we haven't done it yet. It's on our list, though. The last time I profiled the tokenizer, inference was much more expensive than data loading, although perhaps that has changed with the new bulk process operator. If you can provide a patch, we'll definitely look, and otherwise it's on our todo list

…

On Fri, Feb 12, 2021, 1:47 AM ppfeiff ***@***.***> wrote: I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#615 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWICLDYZXNNMA5GPC6DS6T2MRANCNFSM4XEP2DQQ> .

ppfeiff added the bug label Feb 5, 2021

ppfeiff mentioned this issue Feb 18, 2021

Pytorch dataloader #626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline using "genia" throws CUDA error when multiprocessed #615

Pipeline using "genia" throws CUDA error when multiprocessed #615

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pipeline using "genia" throws CUDA error when multiprocessed #615

Pipeline using "genia" throws CUDA error when multiprocessed #615

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!