8000 `OutOfBoundsError` when streaming parquet files with `low_memory=True` · Issue #553 · Lightning-AI/litData · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

OutOfBoundsError when streaming parquet files with low_memory=True #553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kyoungrok0517 opened this issue Apr 13, 2025 · 10 comments
Open
Labels
bug Something isn't working help wanted Extra attention is needed waiting on author Waiting for user input or feedback.

Comments

@kyoungrok0517
Copy link
kyoungrok0517 commented Apr 13, 2025

🐛 Bug

To Reproduce

I'm trying to stream my parquet dataset with low_memory=True option but I encounter OutOfBoundsError. Having no problem when I set low_memory=False. The parquet files are compressed in zstd.

Error
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 382, in __getitem__                                                                           
[rank6]:     return self.cache[index]                                                                                                                                                                                  
[rank6]:            ~~~~~~~~~~^^^^^^^                                                                                                                                                                                  
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/cache.py", line 145, in __getitem__                                                                             
[rank6]:     return self._reader.read(index)                                                                                                                                                                           
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/reader.py", line 388, in read                                                                                   
[rank6]:     item = self._item_loader.load_item_from_chunk(                                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                            
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 619, in load_item_from_chunk                                                              
[rank6]:     return self._get_item_with_low_memory(chunk_index, chunk_filepath, relative_index)                                                                                                                        
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                        
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 678, in _get_item_with_low_memory                                                         
[rank6]:     return row_group_df.row(row_index_within_group, named=True)  # type: ignore                                                                                                                               
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10853, in row                                                                                    
[rank6]:     row = self._df.row_tuple(index)                                                                                                                                                                           
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]: polars.exceptions.OutOfBoundsError: index 791 is out of bounds for sequence of length 791   
Code sample
def _load_dataset(self):
    # Indexed folder
    data_path = Path(self.data_path)
    # Low memory
    item_loader = ParquetLoader(low_memory=True)
    dset = {
        "train": TripleStreamingDataset(
            str(data_path / "split=train"),
            item_loader=item_loader,
        ),
        "val": TripleStreamingDataset(
            str(data_path / "split=val"),
            item_loader=item_loader,
        ),
        "test": TripleStreamingDataset(
            str(data_path / "split=test"),
            item_loader=item_loader,
        ),
    }

    return dset

def setup(self, stage=None):
    # load
    dset = self._load_dataset()

    # train, val, test split
    self.train_dataset = dset["train"]
    self.val_dataset = dset["val"]
    self.test_dataset = dset["test"]

def train_dataloader(self):
    return StreamingDataLoader(
        self.train_dataset,
        batch_size=self.batch_size,
        **self.common_opts,
    )

def val_dataloader(self):
    return StreamingDataLoader(
        self.val_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

def test_dataloader(self):
    return StreamingDataLoader(
        self.test_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

Expected behavior

Work without any problem

Additional context

Environment detail
  • PyTorch Version: 2.6.0
  • OS: Ubuntu 20.04
  • How you installed PyTorch: uv pip
  • Build command you used: ...
  • Python version: 3.11
  • CUDA/cuDNN version: 12.4
  • GPU models and configuration: A100 * 8
  • Any other relevant information: Using lighting + DDP for training
@kyoungrok0517 kyoungrok0517 added bug Something isn't working help wanted Extra attention is needed labels Apr 13, 2025
Copy link

Hi! thanks for your contribution!, great first issue!

@bhimrazy
Copy link
Collaborator
bhimrazy commented Apr 13, 2025

Hi @kyoungrok0517, thanks a lot for opening the issue!

To help us look into it, could you also share a bit more detail about the dataset? If it’s publicly available, a link to the Parquet file would be super helpful. Otherwise, any metadata or a sample schema from the file that's triggering the error could help us reproduce and debug it on our end.

Appreciate your help—thanks!

@kyoungrok0517
Copy link
Author
kyoungrok0517 commented Apr 13, 2025

@bhimrazy Hello. Thanks for the support! Here's the schema and samples of my data. They are pairs of a query and a site, where each site is composed of many document embeddings. I've obfuscated the query and site for privacy. I'm using DistributedSampler and num_workers=8 in lightning to read this data in DDP mode (n_gpu=8).

Schema([('query', String),
        ('site', String),
        ('rank', Int32),
        ('query_embedding', List(Float64)),
        ('site_embeddings', List(List(Float64))),
        ('attention_mask', List(Int32))])
{'query': shape: (1,)
 Series: 'query' [str]
[
	"NbrnTP3fAbnF"
],
'site': shape: (1,)
Series: 'site' [str]
[
	"http://jeunovhpmf.com/i2t1py"
],
'rank': shape: (1,)
Series: 'rank' [i32]
[
	12
],
'query_embedding': shape: (1,)
Series: 'query_embedding' [list[f64]]
[
	[0.146101, 0.047645, … 0.054902]
],
'site_embeddings': shape: (1,)
Series: 'site_embeddings' [list[list[f64]]]
[
	[[-0.0821, 0.0113, … 0.0124], [0.137, 0.0046, … 0.0979], … [-0.2549, -0.0328, … -0.3506]]
],
'attention_mask': shape: (1,)
Series: 'attention_mask' [list[i32]]
[
	[1, 1, … 1]
]}

@bhimrazy
Copy link
Collaborator

Thank you for sharing the details, @kyoungrok0517!
I'll give it a try with the sample dataset and see if I can reproduce the issue on my end.

@tchaton
Copy link
Collaborator
tchaton commented Apr 15, 2025

@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there

8000

@bhimrazy
Copy link
Collaborator
bhimrazy commented Apr 16, 2025

@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there

@tchaton Thanks for the suggestion! I don't think it's related to compression—the file fully does exists before reading due to atomic download thing.
Also, since low_memory=False works, that likely rules out compression issues.

I suspect it might be related to the number of rows within row groups differing across a chunk. Haven’t reproduced it with the sample yet, but I’ll run a few more tests with DDP to be sure.

@bhimrazy
Copy link
Collaborator

Update on the DDP Test:
I've run tests under DDP using the provided sample data, as well as other datasets like OpenThoughts-114k and fineweb-edu/sample/10BT. So far, I haven’t been able to reproduce the same bug.

It’s still unclear the actual reason behind it.

By the way, @kyoungrok0517 — would it be possible for you to create and share a Lightning Studio studio that reproduces the issue using your sample or any dataset where it appears?
Thank you!

cc: @tchaton

@tchaton tchaton added the waiting on author Waiting for user input or feedback. label Apr 23, 2025
@kyoungrok0517
Copy link
Author

@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.

@bhimrazy
Copy link
Collaborator

@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon.

Thank you, @kyoungrok0517 — really appreciate it!

@bhimrazy
Copy link
Collaborator
bhimrazy commented Jun 3, 2025

Hi @kyoungrok0517 👋 — just checking in to see if you’ve had a chance to create the Lightning Studio project that reproduces the issue. No rush at all—just wanted to follow up in case you’ve had a chance or encountered anything new.

If you need help setting it up, this guide might be useful: https://www.youtube.com/watch?v=YcW-2Zt_bFg

Thanks again 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed waiting on author Waiting for user input or feedback.
Projects
None yet
Development

No branches or pull requests

3 participants
0