fix restart policy bug in mpi job UpdateJobConditions #2344

fyxemmmm · 2024-12-05T08:56:18Z

What this PR does / why we need it:
This PR fixes an issue in the MPIJob controller logic related to the handling of failed replicas. 8000 Specifically, the condition for determining the restart policy was incorrect.

Previously, the code used:
if spec.RestartPolicy == kubeflowv1.RestartPolicyExitCode
This condition was problematic because it failed to handle cases where the restart policy should be based on other valid configurations. The incorrect logic could lead to unexpected behavior, such as jobs failing instead of restarting.

The updated code uses:
if spec.RestartPolicy != kubeflowv1.RestartPolicyNever
This ensures that jobs are restarted appropriately unless explicitly configured not to restart, aligning the behavior with the intended design and user expectations.

google-oss-prow · 2024-12-05T08:56:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2024-12-05T14:39:13Z

Pull Request Test Coverage Report for Build 12176368662

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on fangyuxuan/fix_mpi_bug at 100.0%

Totals
Change from base Build 12166368274:	100.0%
Covered Lines:	77
Relevant Lines:	77

💛 - Coveralls

tenzen-y

I was not sure if this is really bugs.
Could you add tests to prove the bug situations?
https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/mpi/mpijob_controller_test.go

andreyvelich · 2025-03-05T10:23:13Z

pkg/controller.v1/mpi/mpijob_controller.go

@@ -597,7 +597,7 @@ func (jc *MPIJobReconciler) UpdateJobStatus(job interface{}, replicas map[kubefl
 			}
 		}
 		if failed > 0 {
-			if spec.RestartPolicy == kubeflowv1.RestartPolicyExitCode {
+			if spec.RestartPolicy != kubeflowv1.RestartPolicyNever {


@fyxemmmm Sorry for the late reply, please can you submit this fix to the release-1.9 branch?

@tenzen-y I think, it is a bug, here is similar code in PyTorchJob controller: https://github.com/kubeflow/trainer/blob/release-1.9/pkg/controller.v1/pytorch/pytorchjob_controller.go#L429

I do not think MPI v1 equals PyTorch since MPI does not handle Job RestartPolicy directly:

trainer/pkg/controller.v1/mpi/mpijob_controller.go

Line 1394 in d0b6ebc

func setRestartPolicy(podTemplateSpec *corev1.PodTemplateSpec, spec *kubeflowv1.ReplicaSpec) {

Oh, I see. Shouldn't we set it as follows then:

if spec.RestartPolicy != kubeflowv1.RestartPolicyNever && if spec.RestartPolicy != kubeflowv1.RestartPolicyExitCode

I'm not sure if the conditional checking works well because MPI v1 doesn't use Generic PodReconciler. MPI v1 has a dedicated Pod Reconciler.

trainer/pkg/controller.v1/mpi/mpijob_controller.go

Line 330 in d0b6ebc

func (jc *MPIJobReconciler) ReconcilePods(

github-actions · 2025-06-03T20:07:45Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fix restart policy bug in UpdateJobConditions

e834e61

google-oss-prow bot requested review from jinchihe and kuizhiqing December 5, 2024 08:56

google-oss-prow bot added the size/XS label Dec 5, 2024

fyxemmmm mentioned this pull request Dec 5, 2024

mpi job bug #2334

Closed

tenzen-y reviewed Dec 13, 2024

View reviewed changes

andreyvelich reviewed Mar 5, 2025

View reviewed changes

github-actions bot added the lifecycle/stale label Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix restart policy bug in mpi job UpdateJobConditions #2344

fix restart policy bug in mpi job UpdateJobConditions #2344

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix restart policy bug in mpi job UpdateJobConditions #2344

Are you sure you want to change the base?

fix restart policy bug in mpi job UpdateJobConditions #2344

Uh oh!

Conversation

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 12176368662

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!