[drmaa-wg] Synchronizing Against Waited Jobs

Rajic, Hrabri hrabri.rajic at intel.com
Mon Oct 24 13:32:36 CDT 2005


Looking from a different angle - if the routine does not return right
away DRMAA_ERRNO_INVALID_JOB might not be reportable at all, because it
is more beneficial to return the synchronization related info than
DRMAA_ERRNO_INVALID_JOB.  That way the user has no idea that the old or
bogus job_ids are valid or not.

Hrabri

>-----Original Message-----
>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
Of
>Rajic, Hrabri
>Sent: Monday, October 24, 2005 11:14 AM
>To: rbrobst at cadence.com; DRMAA Working Group
>Subject: RE: [drmaa-wg] Synchronizing Against Waited Jobs
>
>There could be inconsistencies here.
>
>Just imagine one implementation handles jobID from different earlier
>sessions and the other not.  Upon a crash the application could try to
>submit few more tasks and synchronize on the previous and current
>session jobs leading to different behavior.  Baring unusual scheduling
>it could be beneficial to ensure different DRMAA implementations
produce
>the same behavior.  The current spec does no says anything about
>immediate returns.
>Have to check if there is a global policy on that in the doc.
>
>Similar situation applies if garbage collection happens rendering
>job_ids unrecognizable - this use case implies poor job management in
>the application.
>
>Hrabri
>
>>-----Original Message-----
>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
>Of
>>Roger Brobst
>>Sent: Monday, October 24, 2005 10:37 AM
>>To: DRMAA Working Group
>>Subject: RE: [drmaa-wg] Synchronizing Against Waited Jobs
>>
>>
>>If any of the jobIDs provided to drmaa_synchronize
>>are unrecognized, DRMAA_ERRNO_INVALID_JOB should be
>>returned; no action should be taken upon any recognized
>>jobIDs.
>>
>>-Roger
>>
>>
>>In a previous e-mail, Rajic, Hrabri wrote:
>>> Roger,
>>>
>>> Do you advocate DRMAA returns immediately with
>DRMAA_ERRNO_INVALID_JOB
>>> or after it waited for the remaining legitimate jobs to finish?
>>>
>>> Hrabri
>>>
>>> >-----Original Message-----
>>> >From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On
>Behalf
>>> Of
>>> >Roger Brobst
>>> >Sent: Monday, October 24, 2005 10:16 AM
>>> >To: DRMAA Working Group
>>> >Subject: Re: [drmaa-wg] Synchronizing Against Waited Jobs
>>> >
>>> >
>>> >My opinion ...
>>> >
>>> >Because a DRMAA implementation is not required to
>>> >retain information about jobs which have been reaped,
>>> >drmaa_synchronize should not be required to
>>> >distinguish between non-existant and reaped jobs.
>>> >
>>> >A drmaa_synchronize implementation should return
>>> >DRMAA_ERRNO_INVALID_JOB if a provided jobID is
>>> >unrecognized.
>>> >
>>> >If a drmaa_synchronize implementation successfully
>>> >validates a jobID for a reaped job, it may
>>> >return DRMAA_ERRNO_SUCCESS.
>>> >
>>> >-Roger
>>> >
>>> >
>>> >In a previous e-mail, Daniel Templeton wrote:
>>> >> So, the only person who hasn't weighed in is Roger.
>>> >> Care to offer an opinion?
>>> >>
>>> >> Daniel
>>> >>
>>> >> Peter Troeger wrote On 10/21/05 10:21,:
>>> >>
>>> >> >I support the argumentation of Hrabri. DRMAA introduced
>>> "dispose=true"
>>> >> >in the interface, so resource consumption seems to be an issue.
>If a
>>> job
>>> >> >was subject to drmaa_wait(), and the data was disposed, nothing
>>> should
>>> >> >be left in memory about this job. IMHO the job becomes
completely
>>> >> >unknown to the library after this point.
>>> >> >
>>> >> >
>>> >> >BTW, this holds also for the current Condor DRMAA
implementation.
>It
>>> is
>>> >> >also reasoned by the behavior of the underlying Condor system.
If
>a
>>> job
>>> >> >was finished, only the log files can tell you what happened. The
>>> Condor
>>> >> >DRMAA library uses such a log file for each job, and if you
>execute
>>> >> >drmaa_wait(dispose=true), the log file and in-memory structures
>for
>>> the
>>> >> >job are removed. Calling drmaa_synchronize() after this results
>in
>>> >> >DRMAA_ERRNO_INVALID_JOB.
>>> >> >
>>> >> >Things might be clearer if we would have an explicit
>>> drmaa_dispose_job()
>>> >> >function.
>>> >> >
>>> >> >Regards,
>>> >> >Peter.
>>> >> >
>>> >> >
>>> >> >
>>> >> >Rajic, Hrabri schrieb:
>>> >> >
>>> >> >
>>> >> >
>>> >> >>My wig is in dry cleaning.  Nevertheless, here is my short take
>on
>>> >this.
>>> >> >>
>>> >> >>
>>> >> >>If an implementation has handy job_id's it could conveniently
>make
>>> good
>>> >> >>determination which jobs are invalid (do not exist) and throw
>>> >> >>DRMAA_ERRNO_INVALID_JOB.   IMHO, it is not a big deal if the
>>> routine
>>> >> >>gives imprecise diagnostics if it is forced to do memory
garbage
>>> >> >>collection earlier.  Quality of implementation term comes to
>mind,
>>> but
>>> >> >>that quality could come at the expense of being memory hog that
>in
>>> turn
>>> >> >>could lead to paging - quite dubious.
>>> >> >>See, the implementations might differently handle jobs that did
>not
>>> >come
>>> >> >>
>>> >> >>
>>> >> >>from the current session, so we could not be precise here
>either.
>>> >> >
>>> >> >
>>> >> >>The important thing for the user is to synchronize i.e. block
>>> program
>>> >> >>
>>> >> >>
>>> >> >>from continuing if there are running remote jobs.
>>> >> >
>>> >> >
>>> >> >>
>>> >> >>Dispose = true helps get rid of the rusage info to free DRMAA
>>> >> >>implementations of heavy memory requirements when it matters,
so
>>> >keeping
>>> >> >>all the past job_ids for providing precise exit errors runs
>>> contrary to
>>> >> >>the goal of lessening memory requirements in the same routine.
>>> >> >>
>>> >> >>My 2 pfennigs,
>>> >> >>
>>> >> >>Hrabri
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>-----Original Message-----
>>> >> >>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org]
On
>>> Behalf
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>Of
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>Daniel Templeton
>>> >> >>>Sent: Wednesday, October 19, 2005 11:37 AM
>>> >> >>>To: DRMAA Working Group
>>> >> >>>Subject: [drmaa-wg] Synchronizing Against Waited Jobs
>>> >> >>>
>>> >> >>>We have found a bug in the SGE DRMAA implementation, (I know!
>It's
>>> >> >>>shocking!) but Andreas and I can't agree on what the fix
should
>>> be.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>The
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>issue is that in the current implementation, synchronizing
>against
>>> >jobs
>>> >> >>>that did not come from the current session returns
>>> >DRMAA_ERRNO_SUCCESS.
>>> >> >>>The part about which we disagree is what should happen when
>>> >> >>>synchronizing against jobs that are from the current session,
>but
>>> that
>>> >> >>>have already ended and have already had drmaa_wait() (or
>>> >> >>>drmaa_synchronize() with dispose=true) called against them.
>>> >> >>>
>>> >> >>>My stance is that one can extrapolate from the drmaa_wait()
>>> function
>>> >> >>>that there is no difference between jobs which don't exist (at
>all
>>> or
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>in
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>the current session) and jobs whose exit information has been
>>> disposed
>>> >> >>>(via drmaa_wait() or drmaa_synchronize()).  Therefore, calling
>>> >> >>>drmaa_synchronize() on jobs which have already had
drmaa_wait()
>>> called
>>> >> >>>against them should return DRMAA_ERRNO_INVALID_JOB.
>>> >> >>>
>>> >> >>>Andreas holds that it can be inferred from the lack of the
>above
>>> >> >>>statement in the spec, that drmaa_synchronize() handles such
>jobs
>>> >> >>>differently from drmaa_wait().  Because drmaa_synchronize()
>does
>>> not
>>> >> >>>need the jobs' exit information to succeed, it should be able
>to
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>operate
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>on jobs whose exit information has already been disposed.
>>> Therefore,
>>> >> >>>calling drmaa_synchronize() on jobs which have already had
>>> >drmaa_wait()
>>> >> >>>called against them should return DRMAA_ERRNO_SUCCESS.
>>> >> >>>
>>> >> >>>I can agree that Andreas' position makes theoretical sense,
but
>I
>>> >> >>>believe it runs contrary to the stated goal of minimizing the
>>> >> >>>requirements on the implementing DRMS.  In order to implement
a
>>> >> >>>drmaa_synchronize() that can distinguish between job's that
>have
>>> been
>>> >> >>>disposed and jobs that never existed, the DRMAA implementation
>>> must
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>keep
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>a list of the ids of every job that has ever been submitted in
>the
>>> >> >>>current session, and with every drmaa_synchronize() call, the
>list
>>> >must
>>> >> >>>be searched to validate the synchronize id list.  And for
what?
>>> >> >>>DRMAA_JOB_IDS_ALL covers every case I can think of where the
>>> behavior
>>> >> >>>Andreas described would be useful. To me, it sounds like a lot
>of
>>> >extra
>>> >> >>>work for the DRMAA implementation with no tangible benefit.
>>> >> >>>
>>> >> >>>On what Andreas and I can agree is that if we decide he is
>right,
>>> we
>>> >> >>>will close the bug as "won't fix" because the fix will be
worse
>>> than
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>the
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>bug.  In any case, we should probably have a tracker item to
>make
>>> the
>>> >> >>>final decision explicit in the spec.
>>> >> >>>
>>> >> >>>What say you, oh, wise ones?
>>> >> >>>
>>> >> >>>Daniel
>>> >> >>>
>>> >> >>>--
>>> >> >>>***************************************************
>>> >> >>>*        Daniel Templeton   ERGB01 x60220         *
>>> >> >>>*       Staff Engineer, Sun N1 Grid Engine        *
>>> >> >>>***************************************************
>>> >> >>>* "So let the sunshine in.  Face it with a grin.  *
>>> >> >>>*  Smilers never lose, and frowners never win."   *
>>> >> >>>*      -Let the Sunshine In, Pebbles Flintstone   *
>>> >> >>>***************************************************
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> >> --
>>> >> ***************************************************
>>> >> *        Daniel Templeton   ERGB01 x60220         *
>>> >> *       Staff Engineer, Sun N1 Grid Engine        *
>>> >> ***************************************************
>>> >> * "So let the sunshine in.  Face it with a grin.  *
>>> >> *  Smilers never lose, and frowners never win."   *
>>> >> *      -Let the Sunshine In, Pebbles Flintstone   *
>>> >> ***************************************************
>>> >>
>>>
>>
>>--
>>Roger Brobst                            Cadence Design Systems
>>Internet: rBrobst at cadence.com           555 River Oaks Parkway MS-1B1
>>Voice: (408)894-3422                    San Jose, CA 95134





More information about the drmaa-wg mailing list