-
Notifications
You must be signed in to change notification settings - Fork 56
feat: MLFlow Integration for experiment tracking #534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: MLFlow Integration for experiment tracking #534
Conversation
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! @terrykong to comment on the feature broadly
README.md
Outdated
@@ -316,6 +321,58 @@ sbatch \ | |||
ray.sub | |||
``` | |||
|
|||
## MLflow Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move this to a separate documentation page (instead of top level README?) and link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add this section to https://github.com/NVIDIA-NeMo/RL/blob/main/docs/design-docs/logger.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated readme
examples/configs/sft_mlflow.yaml
Outdated
@@ -0,0 +1,142 @@ | |||
# SFT Algorithm Configuration with MLflow logging | |||
sft: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the mlflow args be put (default false) into the main config? The docs can describe the features and how to turn them on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to defaulting to false
you'll have to add this to all the configs here to make sure nothing breaks from this change:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added mlflow_enabled: false
to the loggers of the configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @therealnaveenkamal for the contribution! appreciate your help making nemo-rl better!
left some comments below
README.md
Outdated
@@ -316,6 +321,58 @@ sbatch \ | |||
ray.sub | |||
``` | |||
|
|||
## MLflow Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add this section to https://github.com/NVIDIA-NeMo/RL/blob/main/docs/design-docs/logger.md?
examples/configs/sft_mlflow.yaml
Outdated
@@ -0,0 +1,142 @@ | |||
# SFT Algorithm Configuration with MLflow logging | |||
sft: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to defaulting to false
you'll have to add this to all the configs here to make sure nothing breaks from this change:
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple other comments
nemo_rl/utils/logger.py
Outdated
Args: | ||
cfg: MLflow configuration | ||
log_dir: Optional log directory (used as artifact_location for MLflow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a typo to not plumb this as the artifact_location?
tests/unit/utils/test_logger.py
Outdated
"experiment_name": "test-experiment", | ||
"run_name": "test-run", | ||
"tracking_uri": "http://localhost:5000", | ||
"artifact_location": "/tmp/artifacts", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to clean this line up after refactor
Sure @terrykong , working on your comments. I'll ping you here once I'm ready. Thanks for your support. |
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@terrykong I think I've addressed all your comments. please review and let me know if further edits are required. |
What does this PR do ?
Add MLflow integration for experiment tracking and model management in NeMo RL
Issues
Closes #514 - Support for MLflow
This PR addresses the community request for MLflow support alongside existing TensorBoard and Weights & Biases logging options.
Usage
Before your PR is "Ready for review"
Pre checks:
Additional Information
This PR adds comprehensive MLflow integration including:
sft_mlflow.yaml
) for SFT training with MLflowThe integration follows the existing logger architecture and provides a seamless experience for users who want to use MLflow for experiment tracking alongside or instead of TensorBoard and Weights & Biases.