[DRMAA-WG] Conference call -Feb 3rd - 17:00 UTC

Daniel Templeton Dan.Templeton at Sun.COM
Tue Feb 3 07:54:42 CST 2009



Andre Merzky wrote:
> Quoting [Daniel Templeton] (Feb 02 2009):
>   
>> Since I won't make the meeting, here's my feedback.
>>
>> Peter Tröger wrote:
>>
>> To represent the temporarily undetermined state, we expand the 
>> TryAgainLaterException to apply to drmaa_job_ps() as well.
>>
>>     
>>> 3. Voting about separate "TERMINATED" vs. "FAILED" state
>>>    - Semantics
>>>  
>>>       
>> A job that exits via the terminated state has the potential to succeed 
>> if resubmitted.  It entered the terminated state due to an action taken 
>> by the job owner, an administrator, or the DRM system itself, possibly 
>> on behalf of the terminated job.  A job that exits via the failed state 
>> is unlikely to succeed if resubmitted.  It entered the failed state due 
>> to an error in the job or a misconfiguration of the machine on which it ran.
>>     
>
> You can't always know if a resubmit will yield a chance of
> success :  a broken file system, or insufficient or bad
> memory, may lead to internal fail states, and may well allow
> the job to succeed next time.  An endless loop in the
> application may always occur, and trigger the scheduler ot
> the system to eventually kill the job, without any chance of
> a later instance to do any better.
>
> So, maybe it is better to distinguish based on the
> information you _do_ have?
>
>   - FAILED:     the job terminated for internal reasons (i.e.
>                 the application met an internal error condition)
>
>   - TERMINATED: the job termination was triggered by an
>                 external entity (e.g by the user, scheduler, system, ...)
>   

Yep.  I agree.  I talked myself out of my overly simple semantics in the 
following paragraph.

>   
>> There is a problem with my clean could-succeed/won't-succeed division.  
>> What if a job failed because the machine it ran on was wonky?  That is 
>> clearly a failure, not a termination, but if the job were resubmitted 
>> and landed on any other machine, it would succeed.  In that case, do we 
>> actually care if there was a difference between failure and termination?
>>
>>     
>>>    - Resulting new job state transitions
>>>  
>>>       
>> There's one more thing we may want to consider.  In SGE, a job can exit 
>> one of four ways.  It can succeed.  It can fail, which includes 
>> termination.  It can request to be rescheduled.  And it can be set into 
>> error state.  The first two are handled fine by drmaa_wait().  The third 
>> can be recognized by drmaa_job_ps(), but it's not ideal.  The fourth is 
>> completely unknowable from DRMAA.  To the DRMAA client, it will look 
>> like the job was requeued to be rescheduled, but is never actually 
>> scheduled to run again.  We might want to consider supporting some 
>> additional states, such as rescheduled or error, or maybe those states 
>> are something that the state/substate model would enable.
>>
>> I vote for making the substate as generic as possible.  I think forcing 
>> it to be an integer in unnecessarily limiting.  Taking some Java APIs as 
>> examples, sometimes the substates are really just text messages that 
>> explain what's going on.  I think that's valid and something we should 
>> allow.
>>     
>
> "If all the tools you have is a hammer, every problem starts
> to look like a nail."  So, my apologies to pulling the same
> string every time I post to this list *blush*
>
> Anyway, you may want to have a look at the SAGA state model,
> again: substates are defined as strings, but SAGA
> implementatios are enouraged to define these strings, and to
> adhere to a namespace.  So, an SGE implementation would
> document the substates of RUNNING as
>
>   SGE:RESCHEDULED
>   SGE:ERROR 
>   
> Well, SGE:ERROR should go into a final state, not into
> RUNNING, right?  But you got the picture. (GFD-90 p.65, last
> paragraph).
>
> Cheers, Andre.
>   

Maybe it's just me, but I have a fundamental problem with string parsing 
to determine call results.  There is no case that an enumeration isn't 
cleaner, clearer, and safer.  I would propose making the substate be a 
void pointer (or Object), so that the implementation can pass back 
whatever it wants.  If you're reading the substates, it's DRM specific 
already, so no reason to avoid forcing the person to cast the substate 
to something DRM specific.

Daniel

>>> 4. Further DRMAA2 discussion
>>>  
>>>       
>> See the attached email from a few weeks ago.
>>
>> Daniel
>>     
>
>   
>> Date: Tue, 20 Jan 2009 08:46:24 -0800
>> From: Daniel Templeton <Dan.Templeton at Sun.COM>
>> Subject: DRMAA v2
>> To: DRMAA Working Group <drmaa-wg at gridforum.org>
>>
>> A few proposals for the meeting today:
>>
>> PT12:
>> < A language binding SHOULD specify numeric values for all DRMAA error 
>> constants.
>> ---
>>     
>>> Such a language binding SHOULD specify numeric values for all DRMAA 
>>>       
>> error constants.
>>
>> PT13:
>> I definitely agree that PartialTimestamp is a boondoggle.  I'm not sure 
>> I agree with using ISO8601, though, mostly because it presupposes a 
>> date/time *string*.  In a high order language, I want to be able to use 
>> the native date/time object.  How about specifying that a language 
>> should use a date/time object or primitive is it has one, and an ISO8601 
>> string if it doesn't?
>>
>> PT20:
>> I think we can handle the resource request pretty easily, and I think we 
>> need it.  We just need to add a resourceRequest attribute of type 
>> Dictionary and treat any such resource request as a hard request.  
>> Alternatively, we could have a hardResourceRequest and a 
>> softResourceRequest.  The former is simpler, but the later saves us from 
>> talking about this again for DRMAAv3. :)
>>
>> Thinking about whether a resource request should be an optional 
>> attribute makes created in me a doubt about the value of the 
>> UnsupportedAttributeException.  Should it be possible to have the 
>> implementation just ignore unsupported optional attributes?  It would 
>> certainly be easier than repeatedly attempting to submit until all the 
>> offending attributes are removed from the template.  Maybe it would help 
>> to have the exception detail *all* unsupported attributes at once.  Just 
>> thinking out loud here...
>>
>> Daniel
>>     


More information about the drmaa-wg mailing list