8000 background job PIDs and PGIDs · Issue #2636 · cylc/cylc-flow · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
background job PIDs and PGIDs #2636
Open
@hjoliver

Description

@hjoliver

From an email conversation with @matthewrmshin:

For background jobs,

  • ​with no execution time limit, the batch system job ID is the job PID
  • with an execution time limit, the batch system job ID is the PID of the timeout process, not the job PID​

The ​job status file correctly records both PIDs, but the GUI shows the timeout PID.

I wonder if it might be best to show the job PID in the GUI, because it's probably natural to assume that for background jobs the "batch system job ID" would be the job PID, and therefore you should be able to kill the job manually with "kill PID". But if you do that, it just kills the timeout command and leaves the job running. On the other hand, if you kill the job ID, the timeout command dies as well. ("cylc kill" does the right thing in both cases, thankfully - although I expect we have tests to check that).

If I remember correctly, cylc kill uses os.killpg to kill a background job, so it should work correctly. Yes it may even have an automated test in the test battery. Otherwise it would have been throughoutly tested during development. It was one of those things that I tried very hard to ensure correctness.

We should document this behaviour and how to kill a process group properly on the command li 575C ne. Killing a job script can leave its child processes orphaned otherwise.

Yes, os.killpg is used. The same can be done with "kill -PID" but I suspect the average user doesn't know about process groups and how to kill them. And besides we don't tell them (explicitly) what the process group ID is.

Hence my question - for background jobs can we display the job PID instead of the timeout command PID in the GUI? Because - unlike currently - that will have the desired result if the user tries to manually kill the job with "kill" instead of "cylc kill".

To put it another way, is there any good reason why we show the timeout command PID as the "batch system job ID" for background jobs? If not, I suggest we show the real job PID. Otherwise, we should document what the job ID actually is, and how to kill a process group, or perhaps consider showing both the batch system job ID and the job PID in the GUI.

If I remember correctly, being able to kill the process group is important because the job script also launches child processes. Killing the process group will kill those child processes properly instead of leaving them as orphans. If we display the PID of the job script instead of the timeout command, users are able to kill the job script directly, but will be unable to kill the child processes in the process group. On the other hand, I can see what you mean as well.

(Now that I have looked through the test battery, I am unable to find a test that tests the killing of a background job with execution time limit.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0