[drmaa-wg] Questions

Thu Apr 7 10:12:12 CDT 2005

On Thu, 7 Apr 2005, Peter Troeger wrote:

>
> >> The routine return code would need to indicate a compound error; BTW we
> >> do not have such error code defined, and the detailed error message
> >> would need to detail what happened.
> >
> >
> > In other words, the spec completely fails to address this case.
> > Something to keep in mind for 1.1 or 2.0.
>
> I added a tracker item for the issue.
>
> >>> When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the
> >>> contract on failure, i.e. in what state will the jobs be left?  In the
> >>> case of a job failure, does that mean that all jobs will be left in the
> >>> state that they were in before the call?  If so, that's going to cause
> >>> serious implementation problems.  If not, that's going to cause serious
> >>> usability problems.
> >>
> >> Transactional interface would be quite useful here ...
> >> If a routine exits/fails during the call there is no good recourse.
> >
> > Exactly the point I'm making.  Without transactions, it's hard to use.
> > With transactions, it's hard to implement.
>
> To demand a transactional behavior seems to me non-realistic. Most other
> groups (e.g. OGSA) have similar problems, take for example the
> SetResourceProperties operation in WS-ResourceProperties specification
> (chapter 7). The usual approach is to declare the problem as
> implementation-dependent.
>
> >>> (DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit
> >>> info with a call to drmaa_wait()?  I would assume that the synchronize
> >>> thread should just assume that the job finished, even though its job
> >>> record is gone.  That is what the SGE implementation does.
> >>
> >>
> >> Ha, races with job reaping info.  The developers would need to be
> >> careful in multithreaded environments ... some guidelines would be
> >> necessary, but preferably outside of the normative docs.
> >
> >
> > The reason I bring it up is that this particular case is non-obvious.
> > It's clear that waiting for the same job twice is bad, but it's not so
> > clear when waiting for any or all.
>
> The result seems to be that we need more clarification about
> multithreading issues in the spec. Is it worthwhile to open a tracker
> item for this, in order to collect all the specific findings ?

DRMAA specifies drmaa_control(DRMAA_JOB_IDS_SESSION_ALL) as an
atomic call. In case of an error one of the DRMAA error codes is
to be returned to indicate the failure. If so the call could be
repeated. I don't see a reasonable means to improve DRMAA spec
for that call so I would argue against filing a tracker item.

Regards,
Andreas