[DRMAA-WG] Conference call -Feb 3rd - 17:00 UTC

Daniel Templeton Dan.Templeton at Sun.COM
Mon Feb 2 16:13:32 CST 2009


Since I won't make the meeting, here's my feedback.

Peter Tröger wrote:
> 2. Voting about "UNDETERMINED" job state
>     - keep it as own job state ?
>   

Yes.  Undetermined stays as a state, but it is redefined to mean 
permanently undetermined.  Trying again later will yield the same result.

>     - Means permanent or temporary problem ?
>   

To represent the temporarily undetermined state, we expand the 
TryAgainLaterException to apply to drmaa_job_ps() as well.

> 3. Voting about separate "TERMINATED" vs. "FAILED" state
>     - Semantics
>   

A job that exits via the terminated state has the potential to succeed 
if resubmitted.  It entered the terminated state due to an action taken 
by the job owner, an administrator, or the DRM system itself, possibly 
on behalf of the terminated job.  A job that exits via the failed state 
is unlikely to succeed if resubmitted.  It entered the failed state due 
to an error in the job or a misconfiguration of the machine on which it ran.

There is a problem with my clean could-succeed/won't-succeed division.  
What if a job failed because the machine it ran on was wonky?  That is 
clearly a failure, not a termination, but if the job were resubmitted 
and landed on any other machine, it would succeed.  In that case, do we 
actually care if there was a difference between failure and termination?

>     - Resulting new job state transitions
>   

There's one more thing we may want to consider.  In SGE, a job can exit 
one of four ways.  It can succeed.  It can fail, which includes 
termination.  It can request to be rescheduled.  And it can be set into 
error state.  The first two are handled fine by drmaa_wait().  The third 
can be recognized by drmaa_job_ps(), but it's not ideal.  The fourth is 
completely unknowable from DRMAA.  To the DRMAA client, it will look 
like the job was requeued to be rescheduled, but is never actually 
scheduled to run again.  We might want to consider supporting some 
additional states, such as rescheduled or error, or maybe those states 
are something that the state/substate model would enable.

I vote for making the substate as generic as possible.  I think forcing 
it to be an integer in unnecessarily limiting.  Taking some Java APIs as 
examples, sometimes the substates are really just text messages that 
explain what's going on.  I think that's valid and something we should 
allow.

> 4. Further DRMAA2 discussion
>   

See the attached email from a few weeks ago.

Daniel
-------------- next part --------------
An embedded message was scrubbed...
From: Daniel Templeton <Dan.Templeton at Sun.COM>
Subject: DRMAA v2
Date: Tue, 20 Jan 2009 08:46:24 -0800
Size: 1905
Url: http://www.ogf.org/pipermail/drmaa-wg/attachments/20090202/4cb60fc6/attachment.mht 


More information about the drmaa-wg mailing list