[drmaa-wg] Synchronizing Against Waited Jobs

Rajic, Hrabri hrabri.rajic at intel.com
Mon Oct 24 11:14:26 CDT 2005


There could be inconsistencies here.

Just imagine one implementation handles jobID from different earlier
sessions and the other not.  Upon a crash the application could try to
submit few more tasks and synchronize on the previous and current
session jobs leading to different behavior.  Baring unusual scheduling
it could be beneficial to ensure different DRMAA implementations produce
the same behavior.  The current spec does no says anything about
immediate returns.
Have to check if there is a global policy on that in the doc.

Similar situation applies if garbage collection happens rendering
job_ids unrecognizable - this use case implies poor job management in
the application.  

Hrabri

>-----Original Message-----
>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
Of
>Roger Brobst
>Sent: Monday, October 24, 2005 10:37 AM
>To: DRMAA Working Group
>Subject: RE: [drmaa-wg] Synchronizing Against Waited Jobs
>
>
>If any of the jobIDs provided to drmaa_synchronize
>are unrecognized, DRMAA_ERRNO_INVALID_JOB should be
>returned; no action should be taken upon any recognized
>jobIDs.
>
>-Roger
>
>
>In a previous e-mail, Rajic, Hrabri wrote:
>> Roger,
>>
>> Do you advocate DRMAA returns immediately with
DRMAA_ERRNO_INVALID_JOB
>> or after it waited for the remaining legitimate jobs to finish?
>>
>> Hrabri
>>
>> >-----Original Message-----
>> >From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On
Behalf
>> Of
>> >Roger Brobst
>> >Sent: Monday, October 24, 2005 10:16 AM
>> >To: DRMAA Working Group
>> >Subject: Re: [drmaa-wg] Synchronizing Against Waited Jobs
>> >
>> >
>> >My opinion ...
>> >
>> >Because a DRMAA implementation is not required to
>> >retain information about jobs which have been reaped,
>> >drmaa_synchronize should not be required to
>> >distinguish between non-existant and reaped jobs.
>> >
>> >A drmaa_synchronize implementation should return
>> >DRMAA_ERRNO_INVALID_JOB if a provided jobID is
>> >unrecognized.
>> >
>> >If a drmaa_synchronize implementation successfully
>> >validates a jobID for a reaped job, it may
>> >return DRMAA_ERRNO_SUCCESS.
>> >
>> >-Roger
>> >
>> >
>> >In a previous e-mail, Daniel Templeton wrote:
>> >> So, the only person who hasn't weighed in is Roger.
>> >> Care to offer an opinion?
>> >>
>> >> Daniel
>> >>
>> >> Peter Troeger wrote On 10/21/05 10:21,:
>> >>
>> >> >I support the argumentation of Hrabri. DRMAA introduced
>> "dispose=true"
>> >> >in the interface, so resource consumption seems to be an issue.
If a
>> job
>> >> >was subject to drmaa_wait(), and the data was disposed, nothing
>> should
>> >> >be left in memory about this job. IMHO the job becomes completely
>> >> >unknown to the library after this point.
>> >> >
>> >> >
>> >> >BTW, this holds also for the current Condor DRMAA implementation.
It
>> is
>> >> >also reasoned by the behavior of the underlying Condor system. If
a
>> job
>> >> >was finished, only the log files can tell you what happened. The
>> Condor
>> >> >DRMAA library uses such a log file for each job, and if you
execute
>> >> >drmaa_wait(dispose=true), the log file and in-memory structures
for
>> the
>> >> >job are removed. Calling drmaa_synchronize() after this results
in
>> >> >DRMAA_ERRNO_INVALID_JOB.
>> >> >
>> >> >Things might be clearer if we would have an explicit
>> drmaa_dispose_job()
>> >> >function.
>> >> >
>> >> >Regards,
>> >> >Peter.
>> >> >
>> >> >
>> >> >
>> >> >Rajic, Hrabri schrieb:
>> >> >
>> >> >
>> >> >
>> >> >>My wig is in dry cleaning.  Nevertheless, here is my short take
on
>> >this.
>> >> >>
>> >> >>
>> >> >>If an implementation has handy job_id's it could conveniently
make
>> good
>> >> >>determination which jobs are invalid (do not exist) and throw
>> >> >>DRMAA_ERRNO_INVALID_JOB.   IMHO, it is not a big deal if the
>> routine
>> >> >>gives imprecise diagnostics if it is forced to do memory garbage
>> >> >>collection earlier.  Quality of implementation term comes to
mind,
>> but
>> >> >>that quality could come at the expense of being memory hog that
in
>> turn
>> >> >>could lead to paging - quite dubious.
>> >> >>See, the implementations might differently handle jobs that did
not
>> >come
>> >> >>
>> >> >>
>> >> >>from the current session, so we could not be precise here
either.
>> >> >
>> >> >
>> >> >>The important thing for the user is to synchronize i.e. block
>> program
>> >> >>
>> >> >>
>> >> >>from continuing if there are running remote jobs.
>> >> >
>> >> >
>> >> >>
>> >> >>Dispose = true helps get rid of the rusage info to free DRMAA
>> >> >>implementations of heavy memory requirements when it matters, so
>> >keeping
>> >> >>all the past job_ids for providing precise exit errors runs
>> contrary to
>> >> >>the goal of lessening memory requirements in the same routine.
>> >> >>
>> >> >>My 2 pfennigs,
>> >> >>
>> >> >>Hrabri
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>-----Original Message-----
>> >> >>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On
>> Behalf
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>Of
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>Daniel Templeton
>> >> >>>Sent: Wednesday, October 19, 2005 11:37 AM
>> >> >>>To: DRMAA Working Group
>> >> >>>Subject: [drmaa-wg] Synchronizing Against Waited Jobs
>> >> >>>
>> >> >>>We have found a bug in the SGE DRMAA implementation, (I know!
It's
>> >> >>>shocking!) but Andreas and I can't agree on what the fix should
>> be.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>The
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>issue is that in the current implementation, synchronizing
against
>> >jobs
>> >> >>>that did not come from the current session returns
>> >DRMAA_ERRNO_SUCCESS.
>> >> >>>The part about which we disagree is what should happen when
>> >> >>>synchronizing against jobs that are from the current session,
but
>> that
>> >> >>>have already ended and have already had drmaa_wait() (or
>> >> >>>drmaa_synchronize() with dispose=true) called against them.
>> >> >>>
>> >> >>>My stance is that one can extrapolate from the drmaa_wait()
>> function
>> >> >>>that there is no difference between jobs which don't exist (at
all
>> or
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>in
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>the current session) and jobs whose exit information has been
>> disposed
>> >> >>>(via drmaa_wait() or drmaa_synchronize()).  Therefore, calling
>> >> >>>drmaa_synchronize() on jobs which have already had drmaa_wait()
>> called
>> >> >>>against them should return DRMAA_ERRNO_INVALID_JOB.
>> >> >>>
>> >> >>>Andreas holds that it can be inferred from the lack of the
above
>> >> >>>statement in the spec, that drmaa_synchronize() handles such
jobs
>> >> >>>differently from drmaa_wait().  Because drmaa_synchronize()
does
>> not
>> >> >>>need the jobs' exit information to succeed, it should be able
to
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>operate
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>on jobs whose exit information has already been disposed.
>> Therefore,
>> >> >>>calling drmaa_synchronize() on jobs which have already had
>> >drmaa_wait()
>> >> >>>called against them should return DRMAA_ERRNO_SUCCESS.
>> >> >>>
>> >> >>>I can agree that Andreas' position makes theoretical sense, but
I
>> >> >>>believe it runs contrary to the stated goal of minimizing the
>> >> >>>requirements on the implementing DRMS.  In order to implement a
>> >> >>>drmaa_synchronize() that can distinguish between job's that
have
>> been
>> >> >>>disposed and jobs that never existed, the DRMAA implementation
>> must
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>keep
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>a list of the ids of every job that has ever been submitted in
the
>> >> >>>current session, and with every drmaa_synchronize() call, the
list
>> >must
>> >> >>>be searched to validate the synchronize id list.  And for what?
>> >> >>>DRMAA_JOB_IDS_ALL covers every case I can think of where the
>> behavior
>> >> >>>Andreas described would be useful. To me, it sounds like a lot
of
>> >extra
>> >> >>>work for the DRMAA implementation with no tangible benefit.
>> >> >>>
>> >> >>>On what Andreas and I can agree is that if we decide he is
right,
>> we
>> >> >>>will close the bug as "won't fix" because the fix will be worse
>> than
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>the
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>bug.  In any case, we should probably have a tracker item to
make
>> the
>> >> >>>final decision explicit in the spec.
>> >> >>>
>> >> >>>What say you, oh, wise ones?
>> >> >>>
>> >> >>>Daniel
>> >> >>>
>> >> >>>--
>> >> >>>***************************************************
>> >> >>>*        Daniel Templeton   ERGB01 x60220         *
>> >> >>>*       Staff Engineer, Sun N1 Grid Engine        *
>> >> >>>***************************************************
>> >> >>>* "So let the sunshine in.  Face it with a grin.  *
>> >> >>>*  Smilers never lose, and frowners never win."   *
>> >> >>>*      -Let the Sunshine In, Pebbles Flintstone   *
>> >> >>>***************************************************
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >>
>> >> --
>> >> ***************************************************
>> >> *        Daniel Templeton   ERGB01 x60220         *
>> >> *       Staff Engineer, Sun N1 Grid Engine        *
>> >> ***************************************************
>> >> * "So let the sunshine in.  Face it with a grin.  *
>> >> *  Smilers never lose, and frowners never win."   *
>> >> *      -Let the Sunshine In, Pebbles Flintstone   *
>> >> ***************************************************
>> >>
>>
>
>--
>Roger Brobst                            Cadence Design Systems
>Internet: rBrobst at cadence.com           555 River Oaks Parkway MS-1B1
>Voice: (408)894-3422                    San Jose, CA 95134





More information about the drmaa-wg mailing list