-
Notifications
You must be signed in to change notification settings - Fork 67
Optimizations using batch mapping #485
New issue
8000Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @RobotSail, I have tried to use batch mapping for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you just have a linter error, try running make fix
or tox -e ruff
and it should fix your files. Otherwise this looks good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the contribution!!
@aryanorpe It looks like you still have some linter errors
Try running either |
Sure @RobotSail will fix the linter error, should I continue and do the same batch mapping optimization for the other functions which are using |
@booxter @JamesKunstle @cdoern Could one of you please take a look at this PR when you get a chance? |
@aryanorpe It looks like you may need to fix your DCO |
…uses data.map in process_messages_into_input_ids function in data_process.py Signed-off-by: aryanorpe <aryorpe@gmail.com>
Signed-off-by: aryanorpe <aryorpe@gmail.com>
…ssages_into_input_ids`. Signed-off-by: aryanorpe <aryorpe@gmail.com>
@RobotSail Done have fixed the DCO |
Hey @JamesKunstle @cdoern @booxter I know you guys are all probably super busy, but could you please take a look at this when you have a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice contribution, thank you!
Three things:
- I had a small note about a function name sample=>batch inline.
- should the batch size be configurable somehow? not sure if that seems relevant.
- could you write a test that'd go in
tests/unit
? You could use Cursor to recommend a few. This isn't a hard requirement to merge (we haven't written those kinds of tests yet!) but it'd be a nice-to-have.
Hey @JamesKunstle, Thank you for the feedback! Sure will definitely incorporate your action items 👍 |
Co-authored-by: James Kunstle <52969093+JamesKunstle@users.noreply.github.com> Signed-off-by: Aryan Orpe <53704316+aryanorpe@users.noreply.github.com>
Signed-off-by: aryanorpe <aryorpe@gmail.com>
Signed-off-by: aryanorpe <aryorpe@gmail.com>
Signed-off-by: aryanorpe <aryorpe@gmail.com>
…nit with linting. Signed-off-by: aryanorpe <aryorpe@gmail.com>
…put_ids_and_labels function. Signed-off-by: aryanorpe <aryorpe@gmail.com>
Hey @RobotSail @JamesKunstle @booxter @joesepi Sorry I know you must be super busy but would really appreciate if you can take a quick look and review my PR. Thank you, Best regards |
Thank you so much @RobotSail for approving my PR! ❤️ Please @JamesKunstle @cdoern @booxter if you can take a few mins of your time to quickly scan my PR, it would mean a lot to me! |
Created PR with changes for issue #449 involving using the datasets library's batched API for efficient data processing. Not completed yet, looking to verify changes made so far to check if I am on the right track. Thanks!