`OutOfBoundsError` when streaming parquet files with `low_memory=True` #553

kyoungrok0517 · 2025-04-13T16:13:52Z

🐛 Bug

To Reproduce

I'm trying to stream my parquet dataset with low_memory=True option but I encounter OutOfBoundsError. Having no problem when I set low_memory=False. The parquet files are compressed in zstd.

Error

[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 382, in __getitem__                                                                           
[rank6]:     return self.cache[index]                                                                                                                                                                                  
[rank6]:            ~~~~~~~~~~^^^^^^^                                                                                                                                                                                  
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/cache.py", line 145, in __getitem__                                                                             
[rank6]:     return self._reader.read(index)                                                                                                                                                                           
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/reader.py", line 388, in read                                                                                   
[rank6]:     item = self._item_loader.load_item_from_chunk(                                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                            
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 619, in load_item_from_chunk                                                              
[rank6]:     return self._get_item_with_low_memory(chunk_index, chunk_filepath, relative_index)                                                                                                                        
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                        
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 678, in _get_item_with_low_memory                                                         
[rank6]:     return row_group_df.row(row_index_within_group, named=True)  # type: ignore                                                                                                                               
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10853, in row                                                                                    
[rank6]:     row = self._df.row_tuple(index)                                                                                                                                                                           
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]: polars.exceptions.OutOfBoundsError: index 791 is out of bounds for sequence of length 791

Code sample

def _load_dataset(self):
    # Indexed folder
    data_path = Path(self.data_path)
    # Low memory
    item_loader = ParquetLoader(low_memory=True)
    dset = {
        "train": TripleStreamingDataset(
            str(data_path / "split=train"),
            item_loader=item_loader,
        ),
        "val": TripleStreamingDataset(
            str(data_path / "split=val"),
            item_loader=item_loader,
        ),
        "test": TripleStreamingDataset(
            str(data_path / "split=test"),
            item_loader=item_loader,
        ),
    }

    return dset

def setup(self, stage=None):
    # load
    dset = self._load_dataset()

    # train, val, test split
    self.train_dataset = dset["train"]
    self.val_dataset = dset["val"]
    self.test_dataset = dset["test"]

def train_dataloader(self):
    return StreamingDataLoader(
        self.train_dataset,
        batch_size=self.batch_size,
        **self.common_opts,
    )

def val_dataloader(self):
    return StreamingDataLoader(
        self.val_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

def test_dataloader(self):
    return StreamingDataLoader(
        self.test_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

Expected behavior

Work without any problem

Additional context

Environment detail

PyTorch Version: 2.6.0
OS: Ubuntu 20.04
How you installed PyTorch: uv pip
Build command you used: ...
Python version: 3.11
CUDA/cuDNN version: 12.4
GPU models and configuration: A100 * 8
Any other relevant information: Using lighting + DDP for training

The text was updated successfully, but these errors were encountered:

github-actions · 2025-04-13T16:14:19Z

Hi! thanks for your contribution!, great first issue!

bhimrazy · 2025-04-13T16:32:20Z

Hi @kyoungrok0517, thanks a lot for opening the issue!

To help us look into it, could you also share a bit more detail about the dataset? If it’s publicly available, a link to the Parquet file would be super helpful. Otherwise, any metadata or a sample schema from the file that's triggering the error could help us reproduce and debug it on our end.

Appreciate your help—thanks!

kyoungrok0517 · 2025-04-13T22:49:35Z

@bhimrazy Hello. Thanks for the support! Here's the schema and samples of my data. They are pairs of a query and a site, where each site is composed of many document embeddings. I've obfuscated the query and site for privacy. I'm using DistributedSampler and num_workers=8 in lightning to read this data in DDP mode (n_gpu=8).

sample-data.tgz

Schema([('query', String),
        ('site', String),
        ('rank', Int32),
        ('query_embedding', List(Float64)),
        ('site_embeddings', List(List(Float64))),
        ('attention_mask', List(Int32))])

{'query': shape: (1,)
 Series: 'query' [str]
[
	"NbrnTP3fAbnF"
],
'site': shape: (1,)
Series: 'site' [str]
[
	"http://jeunovhpmf.com/i2t1py"
],
'rank': shape: (1,)
Series: 'rank' [i32]
[
	12
],
'query_embedding': shape: (1,)
Series: 'query_embedding' [list[f64]]
[
	[0.146101, 0.047645, … 0.054902]
],
'site_embeddings': shape: (1,)
Series: 'site_embeddings' [list[list[f64]]]
[
	[[-0.0821, 0.0113, … 0.0124], [0.137, 0.0046, … 0.0979], … [-0.2549, -0.0328, … -0.3506]]
],
'attention_mask': shape: (1,)
Series: 'attention_mask' [list[i32]]
[
	[1, 1, … 1]
]}

bhimrazy · 2025-04-14T10:11:14Z

Thank you for sharing the details, @kyoungrok0517!
I'll give it a try with the sample dataset and see if I can reproduce the issue on my end.

tchaton · 2025-04-15T09:48:24Z

@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there

bhimrazy · 2025-04-16T06:33:03Z

@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there

@tchaton Thanks for the suggestion! I don't think it's related to compression—the file fully does exists before reading due to atomic download thing.
Also, since low_memory=False works, that likely rules out compression issues.

I suspect it might be related to the number of rows within row groups differing across a chunk. Haven’t reproduced it with the sample yet, but I’ll run a few more tests with DDP to be sure.

bhimrazy · 2025-04-22T07:47:11Z

Update on the DDP Test:
I've run tests under DDP using the provided sample data, as well as other datasets like OpenThoughts-114k and fineweb-edu/sample/10BT. So far, I haven’t been able to reproduce the same bug.

It’s still unclear the actual reason behind it.

By the way, @kyoungrok0517 — would it be possible for you to create and share a Lightning Studio studio that reproduces the issue using your sample or any dataset where it appears?
Thank you!

cc: @tchaton

kyoungrok0517 · 2025-04-24T02:49:03Z

@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.

bhimrazy · 2025-04-24T03:21:07Z

@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.

Thank you, @kyoungrok0517 — really appreciate it!

bhimrazy · 2025-06-03T19:51:07Z

Hi @kyoungrok0517 👋 — just checking in to see if you’ve had a chance to create the Lightning Studio project that reproduces the issue. No rush at all—just wanted to follow up in case you’ve had a chance or encountered anything new.

If you need help setting it up, this guide might be useful: https://www.youtube.com/watch?v=YcW-2Zt_bFg

Thanks again 🙏

kyoungrok0517 added bug Something isn't working help wanted Extra attention is needed labels Apr 13, 2025

tchaton added the waiting on author Waiting for user input or feedback. label Apr 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`OutOfBoundsError` when streaming parquet files with `low_memory=True` #553

`OutOfBoundsError` when streaming parquet files with `low_memory=True` #553

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OutOfBoundsError when streaming parquet files with low_memory=True #553

OutOfBoundsError when streaming parquet files with low_memory=True #553

Comments

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Additional context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`OutOfBoundsError` when streaming parquet files with `low_memory=True` #553

`OutOfBoundsError` when streaming parquet files with `low_memory=True` #553