-
Notifications
You must be signed in to change notification settings - Fork 67
OutOfBoundsError
when streaming parquet files with low_memory=True
#553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi! thanks for your contribution!, great first issue! |
Hi @kyoungrok0517, thanks a lot for opening the issue! To help us look into it, could you also share a bit more detail about the dataset? If it’s publicly available, a link to the Parquet file would be super helpful. Otherwise, any metadata or a sample schema from the file that's triggering the error could help us reproduce and debug it on our end. Appreciate your help—thanks! |
@bhimrazy Hello. Thanks for the support! Here's the schema and samples of my data. They are pairs of a query and a site, where each site is composed of many document embeddings. I've obfuscated the query and site for privacy. I'm using
|
Thank you for sharing the details, @kyoungrok0517! |
@bhimrazy I wonder if the issue isn t coming from the compression where we start reading from the file before it is fully there |
@tchaton Thanks for the suggestion! I don't think it's related to compression—the file fully does exists before reading due to atomic download thing. I suspect it might be related to the number of rows within row groups differing across a chunk. Haven’t reproduced it with the sample yet, but I’ll run a few more tests with DDP to be sure. |
Update on the DDP Test:
By the way, @kyoungrok0517 — would it be possible for you to create and share a Lightning Studio studio that reproduces the issue using your sample or any dataset where it appears? cc: @tchaton |
@bhimrazy Thanks for trying :) I'll create and share a reproducible project on Lightning Studio soon. |
Thank you, @kyoungrok0517 — really appreciate it! |
Hi @kyoungrok0517 👋 — just checking in to see if you’ve had a chance to create the Lightning Studio project that reproduces the issue. No rush at all—just wanted to follow up in case you’ve had a chance or encountered anything new. If you need help setting it up, this guide might be useful: https://www.youtube.com/watch?v=YcW-2Zt_bFg Thanks again 🙏 |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
To Reproduce
I'm trying to stream my parquet dataset with
low_memory=True
option but I encounterOutOfBoundsError
. Having no problem when I setlow_memory=False
. The parquet files are compressed inzstd
.Error
Code sample
Expected behavior
Work without any problem
Additional context
Environment detail
2.6.0
uv pip
...
3.11
12.4
A100 * 8
lighting
+ DDP for trainingThe text was updated successfully, but these errors were encountered: