[DRMAA-WG] DRMAA2: TERMINATED vs. FAILED state

Daniel Templeton Dan.Templeton at Sun.COM
Tue Mar 17 09:20:58 CDT 2009


I think ultimately the purpose here is to be able to tell when a job was 
killed by a user or administrator or a forced migration or some such.  
The internal/external explanation captures that best for me.  I think 
the other subtleties of how exactly a job failed should be expressed 
another way, such as the substate information.

Daniel

Peter Tröger wrote:
> Dear all,
>
> this discussion thread is intended to finalize the discussion about job 
> states after execution end in DRMAA2.
> In DRMAA1, there is only the FAILED state, expressing that the job was 
> running but did not finish successfully for some reason. Piotr proposed 
> a separation between FAILED and TERMINATED jobs:
>
> http://www.ogf.org/pipermail/drmaa-wg/2009-January/000985.html
>
> We meanwhile had different proposals regarding this idea:
>
> Option 1)
> TERMINATED state = resubmission might help,
> FAILED state = resubmission unlikely to help (machine problem, 
> misconfiguration)
>
> Option 2)
> TERMINATED state = triggered by an external entity,
> FAILED state = job terminated by itself
>
> Option 3)
> FAILED state = job command line could not be executed
> TERMINATED state = something else happened
>
> Option 4)
> Stick with FAILED only, and express special circumstances via the new 
> job sub-state information
>
> Issue #5875 (originally form the PBS experience report) criticizes that 
> FAILED currently expresses both user-requested termination and job 
> failure. How is this issue related to the problem ?
>
> Another question is the relation to the wif_* functions.
>
> Please contribute with you opinion.
>
> Thanks,
> Peter.
> --
>   drmaa-wg mailing list
>   drmaa-wg at ogf.org
>   http://www.ogf.org/mailman/listinfo/drmaa-wg
>   


More information about the drmaa-wg mailing list