[drmaa-wg] Synchronizing Against Waited Jobs

Rajic, Hrabri hrabri.rajic at intel.com
Mon Oct 24 10:26:13 CDT 2005


Roger,

Do you advocate DRMAA returns immediately with DRMAA_ERRNO_INVALID_JOB
or after it waited for the remaining legitimate jobs to finish?

Hrabri

>-----Original Message-----
>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
Of
>Roger Brobst
>Sent: Monday, October 24, 2005 10:16 AM
>To: DRMAA Working Group
>Subject: Re: [drmaa-wg] Synchronizing Against Waited Jobs
>
>
>My opinion ...
>
>Because a DRMAA implementation is not required to
>retain information about jobs which have been reaped,
>drmaa_synchronize should not be required to
>distinguish between non-existant and reaped jobs.
>
>A drmaa_synchronize implementation should return
>DRMAA_ERRNO_INVALID_JOB if a provided jobID is
>unrecognized.
>
>If a drmaa_synchronize implementation successfully
>validates a jobID for a reaped job, it may
>return DRMAA_ERRNO_SUCCESS.
>
>-Roger
>
>
>In a previous e-mail, Daniel Templeton wrote:
>> So, the only person who hasn't weighed in is Roger.
>> Care to offer an opinion?
>>
>> Daniel
>>
>> Peter Troeger wrote On 10/21/05 10:21,:
>>
>> >I support the argumentation of Hrabri. DRMAA introduced
"dispose=true"
>> >in the interface, so resource consumption seems to be an issue. If a
job
>> >was subject to drmaa_wait(), and the data was disposed, nothing
should
>> >be left in memory about this job. IMHO the job becomes completely
>> >unknown to the library after this point.
>> >
>> >
>> >BTW, this holds also for the current Condor DRMAA implementation. It
is
>> >also reasoned by the behavior of the underlying Condor system. If a
job
>> >was finished, only the log files can tell you what happened. The
Condor
>> >DRMAA library uses such a log file for each job, and if you execute
>> >drmaa_wait(dispose=true), the log file and in-memory structures for
the
>> >job are removed. Calling drmaa_synchronize() after this results in
>> >DRMAA_ERRNO_INVALID_JOB.
>> >
>> >Things might be clearer if we would have an explicit
drmaa_dispose_job()
>> >function.
>> >
>> >Regards,
>> >Peter.
>> >
>> >
>> >
>> >Rajic, Hrabri schrieb:
>> >
>> >
>> >
>> >>My wig is in dry cleaning.  Nevertheless, here is my short take on
>this.
>> >>
>> >>
>> >>If an implementation has handy job_id's it could conveniently make
good
>> >>determination which jobs are invalid (do not exist) and throw
>> >>DRMAA_ERRNO_INVALID_JOB.   IMHO, it is not a big deal if the
routine
>> >>gives imprecise diagnostics if it is forced to do memory garbage
>> >>collection earlier.  Quality of implementation term comes to mind,
but
>> >>that quality could come at the expense of being memory hog that in
turn
>> >>could lead to paging - quite dubious.
>> >>See, the implementations might differently handle jobs that did not
>come
>> >>
>> >>
>> >>from the current session, so we could not be precise here either.
>> >
>> >
>> >>The important thing for the user is to synchronize i.e. block
program
>> >>
>> >>
>> >>from continuing if there are running remote jobs.
>> >
>> >
>> >>
>> >>Dispose = true helps get rid of the rusage info to free DRMAA
>> >>implementations of heavy memory requirements when it matters, so
>keeping
>> >>all the past job_ids for providing precise exit errors runs
contrary to
>> >>the goal of lessening memory requirements in the same routine.
>> >>
>> >>My 2 pfennigs,
>> >>
>> >>Hrabri
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>>-----Original Message-----
>> >>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On
Behalf
>> >>>
>> >>>
>> >>>
>> >>>
>> >>Of
>> >>
>> >>
>> >>
>> >>
>> >>>Daniel Templeton
>> >>>Sent: Wednesday, October 19, 2005 11:37 AM
>> >>>To: DRMAA Working Group
>> >>>Subject: [drmaa-wg] Synchronizing Against Waited Jobs
>> >>>
>> >>>We have found a bug in the SGE DRMAA implementation, (I know! It's
>> >>>shocking!) but Andreas and I can't agree on what the fix should
be.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>The
>> >>
>> >>
>> >>
>> >>
>> >>>issue is that in the current implementation, synchronizing against
>jobs
>> >>>that did not come from the current session returns
>DRMAA_ERRNO_SUCCESS.
>> >>>The part about which we disagree is what should happen when
>> >>>synchronizing against jobs that are from the current session, but
that
>> >>>have already ended and have already had drmaa_wait() (or
>> >>>drmaa_synchronize() with dispose=true) called against them.
>> >>>
>> >>>My stance is that one can extrapolate from the drmaa_wait()
function
>> >>>that there is no difference between jobs which don't exist (at all
or
>> >>>
>> >>>
>> >>>
>> >>>
>> >>in
>> >>
>> >>
>> >>
>> >>
>> >>>the current session) and jobs whose exit information has been
disposed
>> >>>(via drmaa_wait() or drmaa_synchronize()).  Therefore, calling
>> >>>drmaa_synchronize() on jobs which have already had drmaa_wait()
called
>> >>>against them should return DRMAA_ERRNO_INVALID_JOB.
>> >>>
>> >>>Andreas holds that it can be inferred from the lack of the above
>> >>>statement in the spec, that drmaa_synchronize() handles such jobs
>> >>>differently from drmaa_wait().  Because drmaa_synchronize() does
not
>> >>>need the jobs' exit information to succeed, it should be able to
>> >>>
>> >>>
>> >>>
>> >>>
>> >>operate
>> >>
>> >>
>> >>
>> >>
>> >>>on jobs whose exit information has already been disposed.
Therefore,
>> >>>calling drmaa_synchronize() on jobs which have already had
>drmaa_wait()
>> >>>called against them should return DRMAA_ERRNO_SUCCESS.
>> >>>
>> >>>I can agree that Andreas' position makes theoretical sense, but I
>> >>>believe it runs contrary to the stated goal of minimizing the
>> >>>requirements on the implementing DRMS.  In order to implement a
>> >>>drmaa_synchronize() that can distinguish between job's that have
been
>> >>>disposed and jobs that never existed, the DRMAA implementation
must
>> >>>
>> >>>
>> >>>
>> >>>
>> >>keep
>> >>
>> >>
>> >>
>> >>
>> >>>a list of the ids of every job that has ever been submitted in the
>> >>>current session, and with every drmaa_synchronize() call, the list
>must
>> >>>be searched to validate the synchronize id list.  And for what?
>> >>>DRMAA_JOB_IDS_ALL covers every case I can think of where the
behavior
>> >>>Andreas described would be useful. To me, it sounds like a lot of
>extra
>> >>>work for the DRMAA implementation with no tangible benefit.
>> >>>
>> >>>On what Andreas and I can agree is that if we decide he is right,
we
>> >>>will close the bug as "won't fix" because the fix will be worse
than
>> >>>
>> >>>
>> >>>
>> >>>
>> >>the
>> >>
>> >>
>> >>
>> >>
>> >>>bug.  In any case, we should probably have a tracker item to make
the
>> >>>final decision explicit in the spec.
>> >>>
>> >>>What say you, oh, wise ones?
>> >>>
>> >>>Daniel
>> >>>
>> >>>--
>> >>>***************************************************
>> >>>*        Daniel Templeton   ERGB01 x60220         *
>> >>>*       Staff Engineer, Sun N1 Grid Engine        *
>> >>>***************************************************
>> >>>* "So let the sunshine in.  Face it with a grin.  *
>> >>>*  Smilers never lose, and frowners never win."   *
>> >>>*      -Let the Sunshine In, Pebbles Flintstone   *
>> >>>***************************************************
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>>
>> --
>> ***************************************************
>> *        Daniel Templeton   ERGB01 x60220         *
>> *       Staff Engineer, Sun N1 Grid Engine        *
>> ***************************************************
>> * "So let the sunshine in.  Face it with a grin.  *
>> *  Smilers never lose, and frowners never win."   *
>> *      -Let the Sunshine In, Pebbles Flintstone   *
>> ***************************************************
>>





More information about the drmaa-wg mailing list