multiprocess dataset reader and iterator #1760

joelgrus · 2018-09-12T18:22:02Z

the DatasetReader is a wrapper for any other DatasetReader

the iterator is its own iterator. (our iterator code is way too complicated to make a similar wrapper possible, and anyway it feels less important for the iterator to work that way)

matt-peters

Looks great 👍

matt-peters · 2018-09-12T18:34:51Z

allennlp/data/dataset_readers/multiprocess_dataset_reader.py

+        for _ in range(self.num_workers):
+            input_queue.put(None)
+
+        # TODO(joelgrus): where does this number come from?


We could allow this to be configurable, and may need to increase it depending on how many instances the iterator uses for each batch. We want it to be large enough allow a significant buffer so the iterator doesn't have to wait for instances but not too large that it starts taking up significant memory.

I can do this

matt-peters · 2018-09-12T18:43:04Z

allennlp/data/iterators/multiprocess_iterator.py

+    """
+    instances: List[Instance] = []
+
+    def make_batches() -> None:


Is there any reason we can't re-use an existing iterator here? For many applications we'd want this to have more logic to do something like the bucket iterator, or optionally allow shuffling all of the instances read into memory before creating batches, etc. Right now we will always read each file in the same order.

yes, this is a good idea

matt-peters · 2018-09-12T18:48:19Z

allennlp/data/iterators/multiprocess_iterator.py

+            if len(instances) >= max_instances_in_memory:
+                make_batches()
+
+def _queuer(instances: Iterable[Instance],


Is there a way to remove the _queuer if we are also using the MultiprocessDatasetReader by passing the reader queue directly to the MultiprocessIterator? E.g. the MultiprocessIterator can iterate over either an iterable of instances or a queue filled with instances?

this is the one that's going to be hard, the problem is that the trainer does

train_generator = self._iterator(self._train_data, num_epochs=1, shuffle=self._shuffle)

and so I'd have to a weird coupling where the iterator gets the queue directly, let me think about it a little bit

matt-peters · 2018-09-12T18:49:42Z

allennlp/data/iterators/multiprocess_iterator.py

+                 shuffle: bool = True) -> Iterator[TensorDict]:
+
+        # If you run it forever, the multiprocesses won't shut down correctly.
+        # TODO(joelgrus) find a solution for this


I'm partial to pkill...

matt-peters · 2018-09-12T18:51:06Z

allennlp/data/iterators/multiprocess_iterator.py

+        if num_epochs is None:
+            raise ConfigurationError("Multiprocess Iterator must be run for a fixed number of epochs")
+
+        # TODO(joelgrus) are these the right sizes?


See comment above -- seems we could also make the input queue size approximately batch_size times larger then output queue size.

joelgrus · 2018-09-12T23:40:08Z

ok, I addressed all of your comments except for the "read directly from the queue". it turns out that that's a tricky distributed systems problem (or at least I couldn't figure out how to make it not one).

the key issue is that you have some number of workers that generate instances and put them on the queue.

then you have some other number of workers that pull them off and generate tensor_dicts.

if there is an iterator in the middle, the iterator can recognize when all of the instance-generating workers are done and stop iterating, which tells the tensor-dict-generating workers to stop.

without the iterator in the middle, how do the tensor-dict-generating workers know when to stop? I couldn't figure out a clean solution.

however, I made it so that the Iterator[Instance] that the MultiprocessIterator gets exposes the queue, so that if I can figure out a good solution for this it shouldn't be hard to implement it.

matt-peters

Other then the shuffle it looks great 👍

matt-peters · 2018-09-13T18:43:58Z

allennlp/data/iterators/multiprocess_iterator.py

+            yield instance
+            instance = input_queue.get()
+
+    for tensor_dict in iterator(instances(), num_epochs=1, shuffle=False):


Shouldn't we be passing down the shuffle parameter here? I think it will ignore whatever shuffle parameter is passed to __call__ without it.

joelgrus added 16 commits September 6, 2018 14:26

multiprocess dataset reader

f9abc69

multiprocess iterator

e70033c

add multiprocess + multiprocess test

c592db5

remove list()

b4f21fb

create tensors on cpu, move them to gpu later

57c9991

reorder isinstance checks from more common to less common

12b75c3

Merge branch 'master' into tensors-on-cpu

4501b18

Merge branch 'master' into tensors-on-cpu

bad24ef

add has_tensor check

ae10f55

add tuple test

2370446

Merge branch 'tensors-on-cpu' into multiprocess2

beb86d9

more work on multiprocess iterator

22a257f

Merge branch 'master' into multiprocess2

4148bf3

cleanup

4ec0036

fix merge conflict

9c7420e

Merge branch 'master' into multiprocess2

ce738d3

joelgrus requested a review from matt-peters September 12, 2018 18:22

matt-peters reviewed Sep 12, 2018

View reviewed changes

joelgrus added 4 commits September 12, 2018 15:02

wrap any data iterator

4cb458e

forgot to delete

a9b9454

fix things up

c6fc9da

Merge branch 'master' into multiprocess2

944d3cb

joelgrus added 2 commits September 12, 2018 18:46

fix race condition

21bc05c

share iterator

48a89b8

matt-peters approved these changes Sep 13, 2018

View reviewed changes

joelgrus added 4 commits September 13, 2018 13:12

pass shuffle parameter to base iterators

636c694

Merge branch 'master' into multiprocess2

08ebeb3

fix comment

19ae933

rename iterator -> base_iterator

caf4cba

joelgrus merged commit b5087e7 into allenai:master Sep 13, 2018

matt-gardner mentioned this pull request Sep 14, 2018

Mechanism for a parallel data/batch preparation #1083

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multiprocess dataset reader and iterator #1760

multiprocess dataset reader and iterator #1760

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

multiprocess dataset reader and iterator #1760

multiprocess dataset reader and iterator #1760

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!