[DRMAA-WG] TERMINATED vs. FAILED - reloaded

Peter Tröger peter at troeger.eu
Tue May 5 15:49:02 CDT 2009


Dear all,

The March 31th conference call decided upon the following strategy 
regarding job state model extension:

--- snip

> 4. TERMINATED vs. FAILED state discussion:
> http://www.ogf.org/pipermail/drmaa-wg/2009-March/001012.html

Option 2 from the original mail is now highly preferred. TERMINATED
state should express that an external entity (e.g. user or DRM system)
stopped the job before finishing. For POSIX-aligned systems, this
could be formulated as reception of a signal by "the job". In
contrast, FAILED state now expresses that the application stopped on
its own before finishing. For POSIX-aligned systems, this could be
formulated as reception of a signal "by the job's application process".

We ask for comments from PBS and LSF experts (FedStage ?!?). Do these
systems provide enough error information to distinguish between these
two states  ? For SGE and Condor, Dan and Peter already agreed.

--- snip

Piotr from FedStage informed me that the proposed distinction seems not 
to be implementable in PBS. One solution could be to detect the 
'requested' termination only in the DRMAA library. Dan already expressed 
that this would not reflect the original idea. An intentional job 
termination by another user would then lead to FAILED instead of TERMINATED.

Since we already rejected Option 1 and 3 in the last phone calls, we 
come out with Option 4 as last solution: There will be no new TERMINATED 
state. The new job sub-state concept will allow to express the job 
failure details, but only in a DRM-specific way.

We will finally vote about this in the next call.

Best regards,
Peter.




More information about the drmaa-wg mailing list