8000 feat: MLFlow Integration for experiment tracking by therealnaveenkamal · Pull Request #534 · NVIDIA-NeMo/RL · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: MLFlow Integration for experiment tracking #534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

therealnaveenkamal
Copy link
@therealnaveenkamal therealnaveenkamal commented Jun 21, 2025

What does this PR do ?

Add MLflow integration for experiment tracking and model management in NeMo RL

Issues

Closes #514 - Support for MLflow
This PR addresses the community request for MLflow support alongside existing TensorBoard and Weights & Biases logging options.

Usage

# Install MLflow
uv pip install mlflow

# Run SFT training with MLflow logging
uv run python examples/run_sft.py --config examples/configs/sft_mlflow.yaml

# Start MLflow UI to view results
mlflow ui --host 0.0.0.0 --port 5000

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

This PR adds comprehensive MLflow integration including:

  • MLflowLogger class with support for metrics, hyperparameters, and plot logging
  • Graceful fallbacks for missing configuration keys
  • Example configuration (sft_mlflow.yaml) for SFT training with MLflow
  • Updated README with installation and usage instructions
  • Fixed test coverage for MLflow plot logging functionality

The integration follows the existing logger architecture and provides a seamless experience for users who want to use MLflow for experiment tracking alongside or instead of TensorBoard and Weights & Biases.

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@therealnaveenkamal therealnaveenkamal changed the title Feat: MLFlow Integration for experiment tracking MLFlow Integration for experiment tracking Jun 21, 2025
@therealnaveenkamal therealnaveenkamal changed the title MLFlow Integration for experiment tracking feat: MLFlow Integration for experiment tracking Jun 21, 2025
Copy link
Contributor
@SahilJain314 SahilJain314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! @terrykong to comment on the feature broadly

README.md Outdated
@@ -316,6 +321,58 @@ sbatch \
ray.sub
```

## MLflow Integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this to a separate documentation page (instead of top level README?) and link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated readme

@@ -0,0 +1,142 @@
# SFT Algorithm Configuration with MLflow logging
sft:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the mlflow args be put (default false) into the main config? The docs can describe the features and how to turn them on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to defaulting to false

you'll have to add this to all the configs here to make sure nothing breaks from this change:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added mlflow_enabled: false to the loggers of the configs.

Copy link
Contributor
@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @therealnaveenkamal for the contribution! appreciate your help making nemo-rl better!

left some comments below

README.md Outdated
@@ -316,6 +321,58 @@ sbatch \
ray.sub
```

## MLflow Integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,142 @@
# SFT Algorithm Configuration with MLflow logging
sft:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to defaulting to false

you'll have to add this to all the configs here to make sure nothing breaks from this change:

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 27, 2025
Copy link
Contributor
@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple other comments

Args:
cfg: MLflow configuration
log_dir: Optional log directory (used as artifact_location for MLflow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a typo to not plumb this as the artifact_location?

"experiment_name": "test-experiment",
"run_name": "test-run",
"tracking_uri": "http://localhost:5000",
"artifact_location": "/tmp/artifacts",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to clean this line up after refactor

@therealnaveenkamal
Copy link
Author

Sure @terrykong , working on your comments. I'll ping you here once I'm ready. Thanks for your support.

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@therealnaveenkamal
Copy link
Author

@terrykong I think I've addressed all your comments. please review and let me know if further edits are required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for MLflow
3 participants
0