10000 Optimizations using batch mapping by aryanorpe · Pull Request #485 · instructlab/training · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Optimizations using batch mapping #485

New issue 8000

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

aryanorpe
Copy link

Created PR with changes for issue #449 involving using the datasets library's batched API for efficient data processing. Not completed yet, looking to verify changes made so far to check if I am on the right track. Thanks!

@mergify mergify bot added the ci-failure label Apr 21, 2025
@aryanorpe
Copy link
Author

Hey @RobotSail,

I have tried to use batch mapping for the load_and_validate_dataset function which uses data.map in the process_messages_into_input_ids function, just wanted to double check that this is the correct way of doing it and how I can test this to check if my changes are successful and optimizing the data processing. Thank you!

RobotSail
RobotSail previously approved these changes Apr 23, 2025
Copy link
Member
@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you just have a linter error, try running make fix or tox -e ruff and it should fix your files. Otherwise this looks good 👍

Copy link
Member
@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the contribution!!

@mergify mergify bot added the one-approval label Apr 23, 2025
@RobotSail
Copy link
Member

@aryanorpe It looks like you still have some linter errors

  ERROR: one or more checks have failed.
  Run 'tox -e ruff' to auto-correct all fixable errors.
  ruff: exit 3 (0.18 seconds) /home/runner/work/training/training> ./scripts/ruff.sh check pid=2362
  ruff: FAIL code 3 (2.26=setup[2.08]+cmd[0.18] seconds)
  evaluation failed :( (2.33 seconds)

Try running either make fix or tox -e ruff and it should resolve your issues.

@aryanorpe
Copy link
Author

Sure @RobotSail will fix the linter error, should I continue and do the same batch mapping optimization for the other functions which are using .map or .filter in the process_messages_into_input_ids function?

@RobotSail
Copy link
Member

@booxter @JamesKunstle @cdoern Could one of you please take a look at this PR when you get a chance?

@RobotSail
Copy link
Member

@aryanorpe It looks like you may need to fix your DCO

aryanorpe added 3 commits May 3, 2025 16:04
…uses data.map in process_messages_into_input_ids function in data_process.py

Signed-off-by: aryanorpe <aryorpe@gmail.com>
8000
Signed-off-by: aryanorpe <aryorpe@gmail.com>
…ssages_into_input_ids`.

Signed-off-by: aryanorpe <aryorpe@gmail.com>
@aryanorpe
Copy link
Author

@RobotSail Done have fixed the DCO

@RobotSail
Copy link
Member

Hey @JamesKunstle @cdoern @booxter I know you guys are all probably super busy, but could you please take a look at this when you have a chance?

Copy link
Contributor
@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice contribution, thank you!
Three things:

  1. I had a small note about a function name sample=>batch inline.
  2. should the batch size be configurable somehow? not sure if that seems relevant.
  3. could you write a test that'd go in tests/unit? You could use Cursor to recommend a few. This isn't a hard requirement to merge (we haven't written those kinds of tests yet!) but it'd be a nice-to-have.

@aryanorpe
Copy link
Author
aryanorpe commented May 16, 2025

Hey @JamesKunstle,

Thank you for the feedback! Sure will definitely incorporate your action items 👍

Co-authored-by: James Kunstle <52969093+JamesKunstle@users.noreply.github.com>
Signed-off-by: Aryan Orpe <53704316+aryanorpe@users.noreply.github.com>
@RobotSail RobotSail dismissed their stale review May 17, 2025 21:33

changes made

@mergify mergify bot removed the one-approval label May 17, 2025
aryanorpe added 3 commits May 18, 2025 09:13
Signed-off-by: aryanorpe <aryorpe@gmail.com>
Signed-off-by: aryanorpe <aryorpe@gmail.com>
Signed-off-by: aryanorpe <aryorpe@gmail.com>
@mergify mergify bot added the testing Relates to testing label May 18, 2025
…nit with linting.

Signed-off-by: aryanorpe <aryorpe@gmail.com>
…put_ids_and_labels function.

Signed-off-by: aryanorpe <aryorpe@gmail.com>
@aryanorpe
Copy link
Author

Hey @RobotSail @JamesKunstle @booxter @joesepi

Sorry I know you must be super busy but would really appreciate if you can take a quick look and review my PR.

Thank you,

Best regards
Aryan Orpe

@mergify mergify bot added the one-approval label Jun 17, 2025
@aryanorpe
Copy link
Author

Thank you so much @RobotSail for approving my PR! ❤️

Please @JamesKunstle @cdoern @booxter if you can take a few mins of your time to quickly scan my PR, it would mean a lot to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
one-approval testing Relates to testing
Projects
None yet
39DA Development

Successfully merging this pull request may close these issues.

3 participants
0