[drmaa-wg] Questions
Daniel Templeton
Dan.Templeton at Sun.COM
Wed Mar 30 14:28:50 CST 2005
Rajic, Hrabri wrote:
>>-----Original Message-----
>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
>
> Of
>
>>Daniel Templeton
>>Sent: Wednesday, March 30, 2005 3:34 AM
>>To: DRMAA Working Group
>>Subject: [drmaa-wg] Questions
>>
>>In working on a remote implementation of the Java binding, I have run
>>into a couple of interesting questions. What happens when during a
>
> call
>
>>to drmaa_control (DRMAA_JOB_IDS_SESSION_ALL), more the implementation
>>fails to performs the given action on more than one job for different
>>reasons. For example, if I try to hold all jobs, but one job is
>
> already
>
>>in a hold state, three jobs work ok, and the DRM goes down before
>
> acting
>
>>on the last job, what is the return code?
>
>
> The routine return code would need to indicate a compound error; BTW we
> do not have such error code defined, and the detailed error message
> would need to detail what happened.
In other words, the spec completely fails to address this case.
Something to keep in mind for 1.1 or 2.0.
>>When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the
>>contract on failure, i.e. in what state will the jobs be left? In the
>>case of a job failure, does that mean that all jobs will be left in the
>>state that they were in before the call? If so, that's going to cause
>>serious implementation problems. If not, that's going to cause serious
>>usability problems.
>
>
> Transactional interface would be quite useful here ...
> If a routine exits/fails during the call there is no good recourse.
Exactly the point I'm making. Without transactions, it's hard to use.
With transactions, it's hard to implement.
> Job failure? Is this a separate question?
> One analogy would be teaching a university course. There would be
> students dropping the course, but the rest goes ahead. In case of
> absences things also go ahead, and when the students reappear the regime
> is known.
That's a typo. I meant operation failure.
>>What happens when a job ends after a thread has called
>
> drmaa_synchronize
>
>>(DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit
>>info with a call to drmaa_wait()? I would assume that the synchronize
>>thread should just assume that the job finished, even though its job
>>record is gone. That is what the SGE implementation does.
>
>
> Ha, races with job reaping info. The developers would need to be
> careful in multithreaded environments ... some guidelines would be
> necessary, but preferably outside of the normative docs.
The reason I bring it up is that this particular case is non-obvious.
It's clear that waiting for the same job twice is bad, but it's not so
clear when waiting for any or all.
Daniel
--
***************************************************
* Daniel Templeton ERGB01 x60220 *
* Staff Engineer, Sun N1 Grid Engine *
***************************************************
* "Roads? Where we're going we don't need roads." *
* -Dr. Emmett Brown *
* Back to the Future (1985) *
***************************************************
More information about the drmaa-wg
mailing list