[DRMAA-WG] wifexited and wifsignalled confusion continues

Piotr Domagalski piotr.domagalski at fedstage.com
Wed Nov 12 10:04:17 CST 2008


Hi Roger,

On Wed, Nov 12, 2008 at 4:26 PM, Roger Brobst <rogerb at cadence.com> wrote:
> I believe during a drmaa teleconf (over a year ago)
> it was agreed that the single testcase which validates
> a wide range of exit codes should be split into two
> testcases (one for below 128, the other for above).
> I haven't had an opportunity to dig through the archives
> to substantiate my recollection.

It would be great to have that. I would prefer our implementation to
pass the test suite smoothly so I'd probably vote only for testing
0-128 ;-)

Anyway, event that minor change of splitting it into two tests would be great.

> I think the suggestion to handle 126 and 127 specially
> deserves additional discussion ... but introduces its
> own issues:
>
> If the command is a shell script like:
>    #!/bin/sh
>    sleep 30  # or solve the world's problems
>    exec /some/nonExistant/program
>
> I would expect the shell to exit with status 126
> (because /some/nonExistant/program was not found).
>
> It would be incorrect for the parent of the shell
> to interpret this as 'job never started' since the
> shell could perform any number of tasks before the
> failed exec.

Yes, that's true. To sum up, we need to be aware of two different cases:

- DRM doesn't use shell to start the exec you specify in
DRMAA_REMOTE_COMMAND. That's the case for SGE for example.

When you tell DRMAA to start a non-existing program, you get
DRMAA_PS_FAILED + aborted = true.
When you tell DRMAA to start the above script, you get DRMAA_PS_DONE +
exited = true + exitstatus = 126.

It's possible to completely tell these two cases apart.

- DRM does use shell to start the exec you specify in DRMAA_REMOTE_COMMAND.

When you tell DRMAA to start a non-existing program, you internally
get an exit status of 126.
When you tell DRMAA to start the above script, you internally get an
exit status of 126.

Internally, these two cases look exactly the same for DRMAA
implementator, so she has to decide whether to leave them as they are,
or whether to *always* interpret 126+127 exit codes as DRMAA_PS_FAILED
+ aborted = true.


I tend to agree that it would be much safer to leave it as it is --
i.e. to return DRMAA_PS_DONE and exitstatus = 126/127 in case of
systems using shell to start the program (LSF, in our case). The
interpretation, whether the code was returned because the main program
(DRMAA_REMOTE_COMMAND) was not found or whether it returned the code
(explicitly or because it was sth like the above script), should be
left to the end user.

-- 
Piotr Domagalski


More information about the drmaa-wg mailing list