-
Notifications
You must be signed in to change notification settings - Fork 783
Insights: kubeflow/trainer
Overview
Could not load contribution data
Please try again later
13 Pull requests merged by 6 people
-
Remove SDK
#2657 merged
Jun 9, 2025 -
Tag Docker images with GitHub release tags
#2662 merged
Jun 7, 2025 -
chore(docs): Cherry-pick changelog for Training Operator v1.9.0
#2661 merged
Jun 6, 2025 -
KEP-2401: Support loading local LLMs
#2644 merged
Jun 6, 2025 -
feat(controller): Implement PodSpecOverride API
#2614 merged
Jun 6, 2025 -
Nominate @Electronic-Waste as approver and @astefanutti as reviewer
#2659 merged
Jun 6, 2025 -
chore(build): Support Podman to run OpenAPI generator
#2656 merged
Jun 5, 2025 -
chore(docs): Add OpenSSF Best Practices Badge
#2611 merged
Jun 4, 2025 -
KEP-2401: Support mutating dataset preprocessing config in SDK
#2638 merged
Jun 4, 2025 -
Revert "fix(sdk): Fix type annotation for
train
method'strainer
parameter"#2651 merged
May 27, 2025 -
fix(sdk): Fix bad arg passed to
get_args_using_torchtune_config
#2647 merged
May 27, 2025 -
fix(sdk): Fix type annotation for
train
method'strainer
parameter#2646 merged
May 26, 2025 -
[chore] update stale action version to latest
#2642 merged
May 10, 2025
7 Pull requests opened by 5 people
-
(draft)[proposal] GSoC Project 8: JAX Runtime for V2
#2643 opened
May 10, 2025 -
feat(scheduler):add support for kai scheduler
#2649 opened
May 19, 2025 -
Apply resources appropriately to both launcher and node containers
#2653 opened
May 30, 2025 -
KEP-2442: JAX Runtime
#2654 opened
May 31, 2025 -
docs: Add `LocalTrainerClient` example notebook
#2658 opened
Jun 5, 2025 -
KEP-2628: Support KAI Scheduler in Kubeflow Trainer
#2663 opened
Jun 8, 2025 -
Add Changelog for Trainer v2.0.0-rc.0
#2666 opened
Jun 10, 2025
15 Issues closed by 5 people
-
Is it possible to pass annotation and label to jobset?
#2660 closed
Jun 10, 2025 -
Consider container image rename of `kubeflow/storage-initializer`
#2183 closed
Jun 10, 2025 -
KEP-2401: Support loading local LLMs
#2641 closed
Jun 6, 2025 -
KEP-2170: Support the PodSpecOverrides API in TrainJob
#2218 closed
Jun 6, 2025 -
Unit test for trainer_client.py in the v2 SDK
#2652 closed
Jun 6, 2025 -
GPU benchmark image does not exist
#1672 closed
Jun 5, 2025 -
KEP-2401: Support mutating dataset preprocessing config in SDK
#2506 closed
Jun 4, 2025 -
Permission denied when reading TrainJob function script when run as non-root user
#2372 closed
Jun 4, 2025 -
Support richer volcano scheduling
#2182 closed
May 29, 2025 -
Automate Python SDK release process in GitHub Actions
#1540 closed
May 28, 2025 -
Docs: reference architecture for fault tolerance capabilities
#2157 closed
May 25, 2025 -
Training Operator V2 Installation - Certificate error
#2404 closed
May 17, 2025 -
KEP-2170: Design Trainer for the LLM Runtimes
#2321 closed
May 15, 2025 -
[SDK] add option to specify pip flags
#2398 closed
May 15, 2025 -
Reconsider pre-training and post-training phases for the Training Runtimes
#2430 closed
May 13, 2025
5 Issues opened by 5 people
-
Skip Full CI for Non-Code/Docs-Only PRs
#2664 opened
Jun 8, 2025 -
KEP-2655: Kubeflow Data Cache for distributed training on Kubernetes
#2655 opened
Jun 4, 2025 -
Support for ResourcesPerNode in DeepSpeed Training Job Containers
#2650 opened
May 20, 2025 -
Create Trainer UI
#2648 opened
May 15, 2025 -
release the trainer python models
#2645 opened
May 12, 2025
32 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Add provision to provide local-queue for the training job in SDKv1 an…
#2636 commented on
Jun 4, 2025 • 23 new comments -
KEP-2170: Add manifest overlays for standalone installation
#2527 commented on
Jun 6, 2025 • 6 new comments -
KEP-2170: Add the manifests overlay for Kubeflow Training V2
#2382 commented on
Jun 7, 2025 • 5 new comments -
Fix training client error logs
#2586 commented on
May 29, 2025 • 1 new comment -
fix: istio sidecar injection from annotations to labels
#2637 commented on
May 12, 2025 • 0 new comments -
Add internal-cert-controller disable flag
#2426 commented on
Jun 1, 2025 • 0 new comments -
Imporved the release process of training operator
#2359 commented on
May 28, 2025 • 0 new comments -
fix restart policy bug in mpi job UpdateJobConditions
#2344 commented on
Jun 3, 2025 • 0 new comments -
Enable GPU Testing for LLM Blueprints
#2432 commented on
Jun 9, 2025 • 0 new comments -
Leverage GitHub action arm64 runner
#2422 commented on
Jun 9, 2025 • 0 new comments -
Support KAI Scheduler in Kubeflow Trainer
#2628 commented on
Jun 9, 2025 • 0 new comments -
[Feedback] (the dataset download link gets 403 error) docs/components/training/user-guides/pytorch.md |
#2499 commented on
Jun 9, 2025 • 0 new comments -
Training Operator - panic: runtime error: index out of range
#1842 commented on
Jun 6, 2025 88B1 • 0 new comments -
KEP-2401: Kubeflow LLM Trainer V2
#2401 commented on
Jun 6, 2025 • 0 new comments -
Improve Kubeflow Trainer release process
#2155 commented on
Jun 6, 2025 • 0 new comments -
KEP-2170: Kubeflow Trainer V2 API
#2170 commented on
Jun 6, 2025 • 0 new comments -
Create Slurm runtime for model training using V2 APIs
#2249 commented on
Jun 4, 2025 • 0 new comments -
Add unit tests that cover the `pkg/apply` package
#2452 commented on
Jun 4, 2025 • 0 new comments -
KEP-2170: Add Kubeflow Trainer Pipeline Framework Concept page to Documentation
#2458 commented on
Jun 3, 2025 • 0 new comments -
Add a workflow for publishing Helm charts
#2488 commented on
Jun 2, 2025 • 0 new comments -
"zero-trust" security / networking for training jobs
#2341 commented on
May 29, 2025 • 0 new comments -
Export Models to Kubeflow Model Registry
#2438 commented on
May 29, 2025 • 0 new comments -
Support Volcano Scheduler in Kubeflow Trainer
#2437 commented on
May 29, 2025 • 0 new comments -
Support TensorFlow Runtime
#2443 commented on
May 29, 2025 • 0 new comments -
Distributed training with mutliple pods, with multi-gpu in each pod
#2456 commented on
May 29, 2025 • 0 new comments -
KEP-2170: Add AMD ROCm Torch Distributed Training Runtime
#2335 commented on
May 27, 2025 • 0 new comments -
Add the Config API for Kubeflow Trainer controller manager
#2420 commented on
May 27, 2025 • 0 new comments -
v1: Gang Scheduling for Training Operator V1 with KAI Scheduler
#2627 commented on
May 19, 2025 • 0 new comments -
Cannot fine-tune LLM without GPU - CUDA error and DDP initialization
#2371 commented on
May 18, 2025 • 0 new comments -
ci: disable github actions in forked repo
#2601 commented on
May 16, 2025 • 0 new comments -
Support XGBoost/LightGBM runtime and examples
#2598 commented on
May 12, 2025 • 0 new comments -
Support JAX Runtimes
#2442 commented on
May 10, 2025 • 0 new comments