10000 Workflow finishes too early when parallel branches are running · Issue #6316 · StackStorm/st2 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Workflow finishes too early when parallel branches are running #6316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Dupangel opened this issue Mar 13, 2025 · 12 comments
Open

Workflow finishes too early when parallel branches are running #6316

Dupangel opened this issue Mar 13, 2025 · 12 comments

Comments

@Dupangel
Copy link

SUMMARY

While running our workflows, we found out that some that are supposed to branch under some conditions don't execute until the expected last task on the second branch.

As we created a loop in the workflow, we don't know if it's a real bug or if we have improper use of orquesta workflow engine.
In the last case, some docs may be missing as there's no warning regarding loops on tasks in orquesta engine.

STACKSTORM VERSION

st2 --version: st2 3.5.0, on Python 3.6.8

OS, environment, install method

Running on CentOS Linux release 7.6.1810 (Core) and installed manually (with rpm + dependencies) following installation docs.

Steps to reproduce the problem

Here are two simple workflows to reproduce the problem :

tester_bug.yaml

version: 1.0

description: workflow to reproduce bug with task loop and parallel branching

vars:
  - nextstep: "step1"

tasks:
  entrypoint:
    action: core.noop
    next:
      - do: check_step

  check_step:
    action: core.noop
    next:
      - when: <% ctx(nextstep) = "step1" %>
        publish:
          - nextstep: "step2"
        do:
          - sleep_wf
          - sleep_action
      - when: <% ctx(nextstep) = "step2" %>
        publish:
          - nextstep: "step3"
        do:
          - sleep_action
      - when: <% ctx(nextstep) = "step3" %>
        publish:
          - nextstep: "step4"
        do:
          - sleep_action

  sleep_action:
    action: core.local
    input:
      cmd: "sleep 15"
    next:
      - do: check_step

  sleep_wf:
    action: bull.utilities.sleep_wf
    next:
      - do: check_step


output:
  - message: "toto"

sleep_wf.yaml

version: 1.0

description: workflow that just sleeps

tasks:
  sleep_action:
    action: core.local
    input:
      cmd: "sleep 42"

output:
  - state: "OK"

And their associated metadatas :

mdt_tester_bug.yaml

description: workflow to reproduce bug with task loop and parallel branching
enabled: true
name: tester_bug
notify: {}
pack: toto
runner_type: orquesta
entry_point: "workflows/tester_bug.yaml"

mdt_sleep_wf.yaml

description: workflow that just sleeps
enabled: true
name: sleep_wf
notify: {}
pack: toto
runner_type: orquesta
entry_point: "workflows/sleep_wf.yaml"

Expected Results

StackStorm should return a result only when both branches in tester_bug.yaml are finished.

Actual Results

Instead of returning when both branches are finished, StackStorm terminates workflow when one of them is finished and kills the other.

Example output of this behaviour :

st2 run toto.tester_bug
.........................
id: 67d2d71c806560c75c1bb55c
action.ref: toto.tester_bug
parameters: None
status: succeeded
start_timestamp: Thu, 13 Mar 2025 14:01:16 CET
end_timestamp: Thu, 13 Mar 2025 14:02:06 CET
log:
  - status: requested
    timestamp: '2025-03-13T13:01:16.346000Z'
  - status: scheduled
    timestamp: '2025-03-13T13:01:16.499000Z'
  - status: running
    timestamp: '2025-03-13T13:01:16.559000Z'
  - status: succeeded
    timestamp: '2025-03-13T13:02:05.915000Z'
result:
  output:
    message: toto
+-----------------------------+-------------------------+--------------+-------------------------+-------------------------------+
| id                          | status                  | task         | action                  | start_timestamp               |
+-----------------------------+-------------------------+--------------+-------------------------+-------------------------------+
|   67d2d71c2db78f675e32168f  | succeeded (1s elapsed)  | entrypoint   | core.noop               | Thu, 13 Mar 2025 14:01:16 CET |
|   67d2d71d2db78f675e32169f  | succeeded (0s elapsed)  | check_step   | core.noop               | Thu, 13 Mar 2025 14:01:17 CET |
|   67d2d71d2db78f675e3216af  | succeeded (16s elapsed) | sleep_action | core.local              | Thu, 13 Mar 2025 14:01:17 CET |
| + 67d2d71d2db78f675e3216b5  | succeeded (44s elapsed) | sleep_wf     | toto.sleep_wf           | Thu, 13 Mar 2025 14:01:17 CET |
|    67d2d71e2db78f675e3216c7 | succeeded (42s elapsed) | sleep_action | core.local              | Thu, 13 Mar 2025 14:01:18 CET |
|   67d2d72d2db78f675e3216d8  | succeeded (0s elapsed)  | check_step   | core.noop               | Thu, 13 Mar 2025 14:01:33 CET |
|   67d2d72d2db78f675e3216e8  | succeeded (16s elapsed) | sleep_action | core.local              | Thu, 13 Mar 2025 14:01:33 CET |
|   67d2d73d2db78f675e3216f8  | succeeded (0s elapsed)  | check_step   | core.noop               | Thu, 13 Mar 2025 14:01:49 CET |
|   67d2d73d2db78f675e321708  | succeeded (16s elapsed) | sleep_action | core.local              | Thu, 13 Mar 2025 14:01:49 CET |
|   67d2d7492db78f675e321722  | succeeded (0s elapsed)  | check_step   | core.noop               | Thu, 13 Mar 2025 14:02:01 CET |
|   67d2d7492db78f675e321732  | running (5s elapsed)    | sleep_action | core.local              | Thu, 13 Mar 2025 14:02:01 CET |
|   67d2d74d2db78f675e321742  | succeeded (0s elapsed)  | check_step   | core.noop               | Thu, 13 Mar 2025 14:02:05 CET |
+-----------------------------+-------------------------+--------------+-------------------------+-------------------------------+

We see that a sleep action (not even the last one, only the one before the last) is running while StackStorm already returned a successful result).

@Dupangel
Copy link
Author

Here are the schemas reprensenting expected execution result and the workflow logic :

Image
Image

@fdrab
Copy link
Contributor
fdrab commented Mar 14, 2025

Not necessarily a bug, but rather you expecting it to do something it's not designed to do without explicit instructions. You can try to play around with "JOIN" on the "check_step". If I structure my workflow like this:

           / parallel task 1 wait 15 seconds \
Task                                                              task 3
           \ parallel task 2 wait 10 seconds /

And put a "join" on the task 3, it waits for both parallel branches to finish before executing. If I don't put a "join" on it, task 3 executes when whichever parallel branch finishes the first.

Reference: https://docs.stackstorm.com/orquesta/languages/orquesta.html#task-transition-model

@Dupangel
Copy link
Author
Dupangel commented Mar 14, 2025

Problem is that these two (or more) branches should finish on a final state (sometimes the same, sometimes different ones) without waiting for the other(s). I don't want a join state as from the first check_step execution, the two branches have to run separately.

From the docs I understand that Orquesta engine waits for the end of currently running tasks in parallel branches before exiting (in case of fail, the fail-fast design I suppose) but there's no fail here, and I can't see anything about ending of workflow in case of success with more than one branch running (and without a join).

I may be wrong, but I suppose that instead of "Orquesta engine exits the workflow once all parallel branches are in succeeded state or one of them is in failed state" we have "Orquesta engine exits the workflow once one parallel branch is in succeeded or in failed state"

@nzlosh
Copy link
Contributor
nzlosh commented Mar 14, 2025

I altered the main workflow to include some output when the check_step and sleep_action are called. The workflow is as follows

version: 1.0

description: workflow to reproduce bug with task loop and parallel branching

vars:
  - nextstep: "step1"
  - action_count: 0

tasks:
  entrypoint:
    action: core.noop
    next:
      - do: check_step

  check_step:
    action: core.local
    input:
      cmd: |
        echo [check_step] nextstep=<% ctx(nextstep) %>
    next:
      - when: <% ctx(nextstep) = "step1" %>
        publish:
          - nextstep: "step2"
        do:
          - sleep_action
          - sleep_wf
      - when: <% ctx(nextstep) = "step2" %>
        publish:
          - nextstep: "step3"
        do:
          - sleep_action
      - when: <% ctx(nextstep) = "step3" %>
        publish:
          - nextstep: "step4"
        do:
          - sleep_action

  sleep_action:
    action: core.local
    input:
      cmd: |
        echo [sleep_action][<% ctx(action_count) %>] nextstep=<% ctx(nextstep) %>
        sleep 2
    next:
      - do: check_step
        publish:
          - action_count: <% ctx(action_count) + 1 %>

  sleep_wf:
    action: playpen.sleep_wf
    next:
      - do: check_step

output:
  - message: "toto"

The delay in the sleep_wf is set to 50 seconds. The execution of the workflow produces the below output:

+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+
| id                          | status                  | task            | action           | start_timestamp               |
+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+
|   67d4b2d9dd7fe35270479bab  | succeeded (0s elapsed)  | entrypoint      | core.noop        | Fri, 14 Mar 2025 22:51:05 UTC 
8000
|
|   67d4b2d9dd7fe35270479bbb  | succeeded (0s elapsed)  | check_step      | core.local       | Fri, 14 Mar 2025 22:51:05 UTC |
|   67d4b2d9dd7fe35270479bcb  | succeeded (3s elapsed)  | sleep_action    | core.local       | Fri, 14 Mar 2025 22:51:05 UTC |
| + 67d4b2d9dd7fe35270479bd1  | succeeded (50s elapsed) | sleep_wf        | playpen.sleep_wf | Fri, 14 Mar 2025 22:51:05 UTC |
|    67d4b2dadd7fe35270479be3 | succeeded (48s elapsed) | wf_sleep_action | core.local       | Fri, 14 Mar 2025 22:51:06 UTC |
|   67d4b2dcdd7fe35270479bf4  | succeeded (0s elapsed)  | check_step      | core.local       | Fri, 14 Mar 2025 22:51:08 UTC |
|   67d4b2dcdd7fe35270479c04  | succeeded (2s elapsed)  | sleep_action    | core.local       | Fri, 14 Mar 2025 22:51:08 UTC |
|   67d4b2dedd7fe35270479c14  | succeeded (0s elapsed)  | check_step      | core.local       | Fri, 14 Mar 2025 22:51:10 UTC |
|   67d4b2dedd7fe35270479c24  | succeeded (2s elapsed)  | sleep_action    | core.local       | Fri, 14 Mar 2025 22:51:10 UTC |
|   67d4b2e1dd7fe35270479c34  | succeeded (0s elapsed)  | check_step      | core.local       | Fri, 14 Mar 2025 22:51:13 UTC |
|   67d4b30bdd7fe35270479c54  | succeeded (0s elapsed)  | check_step      | core.local       | Fri, 14 Mar 2025 22:51:55 UTC |
|   67d4b30bdd7fe35270479c64  | succeeded (2s elapsed)  | sleep_action    | core.local       | Fri, 14 Mar 2025 22:51:55 UTC |
|   67d4b30ddd7fe35270479c74  | succeeded (0s elapsed)  | check_step      | core.local       | Fri, 14 Mar 2025 22:51:57 UTC |
|   67d4b30ddd7fe35270479c84  | succeeded (2s elapsed)  | sleep_action    | core.local       | Fri, 14 Mar 2025 22:51:57 UTC |
|   67d4b30fdd7fe35270479c94  | succeeded (1s elapsed)  | check_step      | core.local       | Fri, 14 Mar 2025 22:51:59 UTC |
+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+

We can see the step looping every 2 seconds from 22:51:05 to 22:51:13. A second step loop begins at 22:51:55 which coincides with the exit from the sleep_wf workflow. Iterating over the execution id's and searching for the stdout produces the following result

67d4b2d9dd7fe35270479bab
67d4b2d9dd7fe35270479bbb
  stdout: '[check_step] nextstep=step1'
67d4b2d9dd7fe35270479bcb
  stdout: '[sleep_action][0] nextstep=step2'
67d4b2d9dd7fe35270479bd1
67d4b2dadd7fe35270479be3
  stdout: '[sleep_action]'
67d4b2dcdd7fe35270479bf4
  stdout: '[check_step] nextstep=step2'
67d4b2dcdd7fe35270479c04
  stdout: '[sleep_action][1] nextstep=step3'
67d4b2dedd7fe35270479c14
  stdout: '[check_step] nextstep=step3'
67d4b2dedd7fe35270479c24
  stdout: '[sleep_action][2] nextstep=step4'
67d4b2e1dd7fe35270479c34
  stdout: '[check_step] nextstep=step4'
67d4b30bdd7fe35270479c54
  stdout: '[check_step] nextstep=step2'
67d4b30bdd7fe35270479c64
  stdout: '[sleep_action][0] nextstep=step3'
67d4b30ddd7fe35270479c74
  stdout: '[check_step] nextstep=step3'
67d4b30ddd7fe35270479c84
  stdout: '[sleep_action][1] nextstep=step4'
67d4b30fdd7fe35270479c94
  stdout: '[check_step] nextstep=step4'

What can be seen is that action count is 0 in execution 67d4b2d9dd7fe35270479bcb (22:51:05) and again it is 0 in execution 67d4b30bdd7fe35270479c64 (22:51:55).

The workflow only increments the action counter which leads me to think the current execution context is being overwritten with stale data when the parallel task completes. I suspect this is a bug in Orquesta, not StackStorm core, but needs to be investigated to be confirmed.

These tests were run on

st2 --version
st2 3.9dev (), on Python 3.8.10

@nzlosh
Copy link
Contributor
nzlosh commented Mar 15, 2025

Despite my opinion that this is a bug, @fdrab is correct in pointing out that this is expected behaviour which is documented here https://docs.stackstorm.com/orquesta/context.html#assignment-scope

For variables with the same name between the context dictionaries, the branch that writes last will overwrite the value in the merged context dictionary.

This basically means, you'll need to rethink how you're controlling the flow of the workflow using something other than the context. It could be using an external data source such as redis/consul/etcd or whatever best fits the use case. Some sort of locking will need to be used to ensure consistency when updating the nextstep variable.

@Dupangel
Copy link
Author

I don't understand why this nextstep variable should be locked as branches context dictionaries are merged when branches converge :

In a workflow with parallel branches, the context dictionary is scoped to each branch and merged when the branches converge with join.

As far as I understand, a join naturally happens when all branches are completed to create the workflow global context used for output. It's not the same case here as I don't do any explicit join so there shouldn't be any merge until both branches finish.

Moreover in your previous example, the execution doesn't seem to reproduce the behaviour as all your sleep_action are finished without exiting after the last step of the first branch (your last execution_id of first branch is 67d4b2e1dd7fe35270479c34 while mine is 67d2d74d2db78f675e321742 and 67d2d7492db78f675e321732 is still running).

@Dupangel
Copy link
Author
Dupangel commented Mar 17, 2025

I tried with your timers (sleep 2 for sleep_action and sleep 50 for sleep_wf) and I confirm I don't see the behaviour anymore.
However when getting timers back to previous values (sleep 15 for sleep_action and sleep 42 for sleep_wf) I get this bug on every run.

@nzlosh
Copy link
Contributor
nzlosh commented Mar 17, 2025

I re-ran the workflow with sleep 15 and sleep_wf 45 and got the expected behaviour. (standard output is appended to each line)

+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+
| id                          | status                  | task            | action           | start_timestamp               |
+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+
|   67d7ef06a074d058f7e4c393  | succeeded (0s elapsed)  | entrypoint      | core.noop        | Mon, 17 Mar 2025 09:44:38 UTC |
|   67d7ef06a074d058f7e4c3a3  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:44:38 UTC | '[check_step] nextstep=step1'
|   67d7ef06a074d058f7e4c3b3  | succeeded (16s elapsed) | sleep_action    | core.local       | Mon, 17 Mar 2025 09:44:38 UTC | '[sleep_action][0] nextstep=step2'
| + 67d7ef06a074d058f7e4c3b9  | succeeded (42s elapsed) | sleep_wf        | playpen.sleep_wf | Mon, 17 Mar 2025 09:44:38 UTC |
|    67d7ef07a074d058f7e4c3cb | succeeded (41s elapsed) | wf_sleep_action | core.local       | Mon, 17 Mar 2025 09:44:39 UTC | '[sleep_action]'
|   67d7ef16a074d058f7e4c3dc  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:44:54 UTC | '[check_step] nextstep=step2'
|   67d7ef16a074d058f7e4c3ec  | succeeded (14s elapsed) | sleep_action    | core.local       | Mon, 17 Mar 2025 09:44:54 UTC | '[sleep_action][1] nextstep=step3'
|   67d7ef24a074d058f7e4c3fc  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:45:08 UTC | '[check_step] nextstep=step3'
|   67d7ef24a074d058f7e4c40c  | succeeded (0s elapsed)  | exit_point      | core.noop        | Mon, 17 Mar 2025 09:45:08 UTC |
|   67d7ef24a074d058f7e4c412  | succeeded (16s elapsed) | sleep_action    | core.local       | Mon, 17 Mar 2025 09:45:08 UTC | '[sleep_action][2] nextstep=step4'
|   67d7ef30a074d058f7e4c434  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:45:20 UTC | '[check_step] nextstep=step2'
|   67d7ef30a074d058f7e4c444  | running (4s elapsed)    | sleep_action    | core.local       | Mon, 17 Mar 2025 09:45:20 UTC | '[sleep_action][0] nextstep=step3'
|   67d7ef34a074d058f7e4c454  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:45:24 UTC | '[check_step] nextstep=step4'
|   67d7ef34a074d058f7e4c464  | succeeded (0s elapsed)  | exit_point      | core.noop        | Mon, 17 Mar 2025 09:45:24 UTC |
+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+

The display shows 67d7ef30a074d058f7e4c444 as still running on exit. The action does complete as shown in a subsequent fetch of the parent execution.

+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+
| id                          | status                  | task            | action           | start_timestamp               |
+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+
|   67d7ef06a074d058f7e4c393  | succeeded (0s elapsed)  | entrypoint      | core.noop        | Mon, 17 Mar 2025 09:44:38 UTC |
|   67d7ef06a074d058f7e4c3a3  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:44:38 UTC |
|   67d7ef06a074d058f7e4c3b3  | succeeded (16s elapsed) | sleep_action    | core.local       | Mon, 17 Mar 2025 09:44:38 UTC |
| + 67d7ef06a074d058f7e4c3b9  | succeeded (42s elapsed) | sleep_wf        | playpen.sleep_wf | Mon, 17 Mar 2025 09:44:38 UTC |
|    67d7ef07a074d058f7e4c3cb | succeeded (41s elapsed) | wf_sleep_action | core.local       | Mon, 17 Mar 2025 09:44:39 UTC |
|   67d7ef16a074d058f7e4c3dc  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:44:54 UTC |
|   67d7ef16a074d058f7e4c3ec  | succeeded (14s elapsed) | sleep_action    | core.local       | Mon, 17 Mar 2025 09:44:54 UTC |
|   67d7ef24a074d058f7e4c3fc  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:45:08 UTC |
|   67d7ef24a074d058f7e4c40c  | succeeded (0s elapsed)  | exit_point      | core.noop        | Mon, 17 Mar 2025 09:45:08 UTC |
|   67d7ef24a074d058f7e4c412  | succeeded (16s elapsed) | sleep_action    | core.local       | Mon, 17 Mar 2025 09:45:08 UTC |
|   67d7ef30a074d058f7e4c434  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:45:20 UTC |
|   67d7ef30a074d058f7e4c444  | succeeded (14s elapsed) | sleep_action    | core.local       | Mon, 17 Mar 2025 09:45:20 UTC |
|   67d7ef34a074d058f7e4c454  | succeeded (0s elapsed)  | check_step      | core.local       | Mon, 17 Mar 2025 09:45:24 UTC |
|   67d7ef34a074d058f7e4c464  | succeeded (0s elapsed)  | exit_point      | core.noop        | Mon, 17 Mar 2025 09:45:24 UTC |
+-----------------------------+-------------------------+-----------------+------------------+-------------------------------+

@Dupangel
Copy link
Author
Dupangel commented Mar 17, 2025

It means that one of your sleep_action hasn't been executed. If we look at the workflow file, you should execute sleep_action 5 times (step 1 once, then twice on step 2 and 3 because of the two branches). Here I see that only 4 of them are executed and st2 returns while one is still running.

That's the point : I suppose that Orquesta ends workflow too early as one branch (and here action) is still running but the first one has finished. By doing this, some actions expected to run on the second branch doesn't execute and the workflow is not fully completed from user pov.

@nzlosh
Copy link
Contributor
nzlosh commented Mar 19, 2025

I see what you mean about the expected execution of 5 sleep_action steps. I dug deeper into the Orquesta code to see what's going on. It uses networkx under the hood to build a Multi-edge Directed Graph. When the workflow yaml is processed, the following multi-set of Nodes/Vertices and Edges are produced

NODES/VERTICES:

 entrypoint
 check_step
 sleep_action
 sleep_wf

EDGES:

 (entrypoint, check_step)
 (check_step, sleep_action)
 (check_step, sleep_action)
 (check_step, sleep_action)
 (check_step, sleep_wf)
 (sleep_action, check_step)
 (sleep_wf, check_step)

As I understand it (I may be wrong), the edges correspond to a linear evaluation of the workflow where the first evaluation of check step creates a pair of edges for sleep_action and sleep_wf. When the edges return to the check_step they are merged as a single branch (similar to the concept of a barrier) and the notion of parallel branches is lost. I don't know if this is by design. The graph corresponds with the execution behaviour with the exception of the additional sleep_action step left in the RUNNING state, which is explained by this comment in the code.

        # self._graph is the graph model for the workflow. The tracking of workflow and task
        # progress and state is separate from the graph model. There are use cases where tasks
        # may be cycled and states overwritten.

As a workaround, I suggest not using the pattern of returning to the same node (check_step) once the workflow is branched, but create two distinct paths so the edges can be resolved correctly. I know this isn't optimal from the workflow point of view, but currently Orquesta doesn't support this type of parallel loop pattern.

I'll keep investigating but I'm not sure I'll have enough free time to reach a full resolution to this issue.

@Dupangel
Copy link
Author

I also dug a bit into Orquesta too this morning ^^'
I tried to reproduce the graph which is created by the engine with networkx. Here's what st2 outputs :

{
   'directed': True,
   'multigraph': True,
   'graph': [],
   'nodes': [
      {'id': 'entrypoint'},
      {'id': 'check_step'},
      {'id': 'sleep_action'},
      {'id': 'sleep_wf'}
   ],
   'adjacency':
      [
         [
            {
               'criteria': [],
               'ref': 0,
               'id': 'check_step',
               'key': 0
            }
         ],
         [
            {
               'criteria': ['<% ctx(nextstep) = "step1" %>'],
               'ref': 0,
               'id': 'sleep_action',
               'key': 0
            },
            {
               'criteria': ['<% ctx(nextstep) = "step2" %>'],
               'ref': 1,
               'id': 'sleep_action',
               'key': 1
            },
            {
               'criteria': ['<% ctx(nextstep) = "step3" %>'],
               'ref': 2,
               'id': 'sleep_action',
               'key': 2
            },
            {
               'criteria': ['<% ctx(nextstep) = "step1" %>'],
               'ref': 0,
               'id': 'sleep_wf',
               'key': 0
            }
         ],
         [
            {
               'criteria': [],
               'ref': 0,
               'id': 'check_step',
               'key': 0
            }
         ],
         [
            {
               'criteria': [],
               'ref': 0,
               'id': 'check_step',
               'key': 0
            }
         ]
      ]
   }

However it seems that a MultiDiGraph can support parallel edges, that's why I'm a bit surprised that Orquesta looses this feature.

Found on Networkx doc website :

MultiDiGraph - Directed graphs with self loops and parallel edges
Overview

MultiDiGraph(data=None, **attr)

    A directed graph class that can store multiedges.

    Multiedges are multiple edges between two nodes. Each edge can hold optional data or attributes.

    A MultiDiGraph holds directed edges. Self loops are allowed.

    Nodes can be arbitrary (hashable) Python objects with optional key/value attributes.

    Edges are represented as links between nodes with optional key/value attributes.

As a beginner with Orquesta engine, I didn't find where the kind of join/barrier is done.
However, I found that as the check_step is a split task and in a cycle, Orquesta doesn't keep track of the split :

>>> wf_spec.tasks.is_split_task("check_step")
True
>>> wf_spec.tasks.in_cycle("check_step")
True

Because of this check on line 64

# Determine if the task is a split task and if it is in a cycle. If the task is a
# split task, keep track of where the split(s) occurs.
if wf_spec.tasks.is_split_task(task_name) and not wf_spec.tasks.in_cycle(task_name):
    splits.append(task_name)

if splits:
    wf_graph.update_task(task_name, splits=splits)

To me, it seems that Orquesta doesn't want a cycle on a split task (maybe to limit infinite parallel branch creation).

@cognifloyd
Copy link
Member
cognifloyd commented Mar 22, 2025

I think I would use sub workflows for something like your example workflow. A task that explicitly starts 2 branches (next: [do: [branch_a, branch_b]]) that go to two tasks (or can probably even be the same task (next: [do: [sub_workflow, sub_workflow]]) that start the same sub workflow action. Then do whatever looping you need within that sub workflow. Then the 2 branches will be isolated enough to avoid mangling the context for the other branch.

Note: this is quirky behavior in orquesta. It would be nice to fix this and other issues at some point. For now, hopefully this gives you a way to do what you need to with orquesta as it is today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0