[drmaa-wg] Synchronizing Against Waited Jobs

Roger Brobst rogerb at cadence.com
Mon Oct 24 10:36:31 CDT 2005


If any of the jobIDs provided to drmaa_synchronize
are unrecognized, DRMAA_ERRNO_INVALID_JOB should be
returned; no action should be taken upon any recognized
jobIDs.

-Roger


In a previous e-mail, Rajic, Hrabri wrote:
> Roger,
> 
> Do you advocate DRMAA returns immediately with DRMAA_ERRNO_INVALID_JOB
> or after it waited for the remaining legitimate jobs to finish?
> 
> Hrabri
> 
> >-----Original Message-----
> >From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
> Of
> >Roger Brobst
> >Sent: Monday, October 24, 2005 10:16 AM
> >To: DRMAA Working Group
> >Subject: Re: [drmaa-wg] Synchronizing Against Waited Jobs
> >
> >
> >My opinion ...
> >
> >Because a DRMAA implementation is not required to
> >retain information about jobs which have been reaped,
> >drmaa_synchronize should not be required to
> >distinguish between non-existant and reaped jobs.
> >
> >A drmaa_synchronize implementation should return
> >DRMAA_ERRNO_INVALID_JOB if a provided jobID is
> >unrecognized.
> >
> >If a drmaa_synchronize implementation successfully
> >validates a jobID for a reaped job, it may
> >return DRMAA_ERRNO_SUCCESS.
> >
> >-Roger
> >
> >
> >In a previous e-mail, Daniel Templeton wrote:
> >> So, the only person who hasn't weighed in is Roger.
> >> Care to offer an opinion?
> >>
> >> Daniel
> >>
> >> Peter Troeger wrote On 10/21/05 10:21,:
> >>
> >> >I support the argumentation of Hrabri. DRMAA introduced
> "dispose=true"
> >> >in the interface, so resource consumption seems to be an issue. If a
> job
> >> >was subject to drmaa_wait(), and the data was disposed, nothing
> should
> >> >be left in memory about this job. IMHO the job becomes completely
> >> >unknown to the library after this point.
> >> >
> >> >
> >> >BTW, this holds also for the current Condor DRMAA implementation. It
> is
> >> >also reasoned by the behavior of the underlying Condor system. If a
> job
> >> >was finished, only the log files can tell you what happened. The
> Condor
> >> >DRMAA library uses such a log file for each job, and if you execute
> >> >drmaa_wait(dispose=true), the log file and in-memory structures for
> the
> >> >job are removed. Calling drmaa_synchronize() after this results in
> >> >DRMAA_ERRNO_INVALID_JOB.
> >> >
> >> >Things might be clearer if we would have an explicit
> drmaa_dispose_job()
> >> >function.
> >> >
> >> >Regards,
> >> >Peter.
> >> >
> >> >
> >> >
> >> >Rajic, Hrabri schrieb:
> >> >
> >> >
> >> >
> >> >>My wig is in dry cleaning.  Nevertheless, here is my short take on
> >this.
> >> >>
> >> >>
> >> >>If an implementation has handy job_id's it could conveniently make
> good
> >> >>determination which jobs are invalid (do not exist) and throw
> >> >>DRMAA_ERRNO_INVALID_JOB.   IMHO, it is not a big deal if the
> routine
> >> >>gives imprecise diagnostics if it is forced to do memory garbage
> >> >>collection earlier.  Quality of implementation term comes to mind,
> but
> >> >>that quality could come at the expense of being memory hog that in
> turn
> >> >>could lead to paging - quite dubious.
> >> >>See, the implementations might differently handle jobs that did not
> >come
> >> >>
> >> >>
> >> >>from the current session, so we could not be precise here either.
> >> >
> >> >
> >> >>The important thing for the user is to synchronize i.e. block
> program
> >> >>
> >> >>
> >> >>from continuing if there are running remote jobs.
> >> >
> >> >
> >> >>
> >> >>Dispose = true helps get rid of the rusage info to free DRMAA
> >> >>implementations of heavy memory requirements when it matters, so
> >keeping
> >> >>all the past job_ids for providing precise exit errors runs
> contrary to
> >> >>the goal of lessening memory requirements in the same routine.
> >> >>
> >> >>My 2 pfennigs,
> >> >>
> >> >>Hrabri
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>-----Original Message-----
> >> >>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On
> Behalf
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>Of
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>Daniel Templeton
> >> >>>Sent: Wednesday, October 19, 2005 11:37 AM
> >> >>>To: DRMAA Working Group
> >> >>>Subject: [drmaa-wg] Synchronizing Against Waited Jobs
> >> >>>
> >> >>>We have found a bug in the SGE DRMAA implementation, (I know! It's
> >> >>>shocking!) but Andreas and I can't agree on what the fix should
> be.
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>The
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>issue is that in the current implementation, synchronizing against
> >jobs
> >> >>>that did not come from the current session returns
> >DRMAA_ERRNO_SUCCESS.
> >> >>>The part about which we disagree is what should happen when
> >> >>>synchronizing against jobs that are from the current session, but
> that
> >> >>>have already ended and have already had drmaa_wait() (or
> >> >>>drmaa_synchronize() with dispose=true) called against them.
> >> >>>
> >> >>>My stance is that one can extrapolate from the drmaa_wait()
> function
> >> >>>that there is no difference between jobs which don't exist (at all
> or
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>in
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>the current session) and jobs whose exit information has been
> disposed
> >> >>>(via drmaa_wait() or drmaa_synchronize()).  Therefore, calling
> >> >>>drmaa_synchronize() on jobs which have already had drmaa_wait()
> called
> >> >>>against them should return DRMAA_ERRNO_INVALID_JOB.
> >> >>>
> >> >>>Andreas holds that it can be inferred from the lack of the above
> >> >>>statement in the spec, that drmaa_synchronize() handles such jobs
> >> >>>differently from drmaa_wait().  Because drmaa_synchronize() does
> not
> >> >>>need the jobs' exit information to succeed, it should be able to
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>operate
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>on jobs whose exit information has already been disposed.
> Therefore,
> >> >>>calling drmaa_synchronize() on jobs which have already had
> >drmaa_wait()
> >> >>>called against them should return DRMAA_ERRNO_SUCCESS.
> >> >>>
> >> >>>I can agree that Andreas' position makes theoretical sense, but I
> >> >>>believe it runs contrary to the stated goal of minimizing the
> >> >>>requirements on the implementing DRMS.  In order to implement a
> >> >>>drmaa_synchronize() that can distinguish between job's that have
> been
> >> >>>disposed and jobs that never existed, the DRMAA implementation
> must
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>keep
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>a list of the ids of every job that has ever been submitted in the
> >> >>>current session, and with every drmaa_synchronize() call, the list
> >must
> >> >>>be searched to validate the synchronize id list.  And for what?
> >> >>>DRMAA_JOB_IDS_ALL covers every case I can think of where the
> behavior
> >> >>>Andreas described would be useful. To me, it sounds like a lot of
> >extra
> >> >>>work for the DRMAA implementation with no tangible benefit.
> >> >>>
> >> >>>On what Andreas and I can agree is that if we decide he is right,
> we
> >> >>>will close the bug as "won't fix" because the fix will be worse
> than
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>the
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>bug.  In any case, we should probably have a tracker item to make
> the
> >> >>>final decision explicit in the spec.
> >> >>>
> >> >>>What say you, oh, wise ones?
> >> >>>
> >> >>>Daniel
> >> >>>
> >> >>>--
> >> >>>***************************************************
> >> >>>*        Daniel Templeton   ERGB01 x60220         *
> >> >>>*       Staff Engineer, Sun N1 Grid Engine        *
> >> >>>***************************************************
> >> >>>* "So let the sunshine in.  Face it with a grin.  *
> >> >>>*  Smilers never lose, and frowners never win."   *
> >> >>>*      -Let the Sunshine In, Pebbles Flintstone   *
> >> >>>***************************************************
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >>
> >> --
> >> ***************************************************
> >> *        Daniel Templeton   ERGB01 x60220         *
> >> *       Staff Engineer, Sun N1 Grid Engine        *
> >> ***************************************************
> >> * "So let the sunshine in.  Face it with a grin.  *
> >> *  Smilers never lose, and frowners never win."   *
> >> *      -Let the Sunshine In, Pebbles Flintstone   *
> >> ***************************************************
> >>
> 

-- 
Roger Brobst                            Cadence Design Systems
Internet: rBrobst at cadence.com           555 River Oaks Parkway MS-1B1
Voice: (408)894-3422                    San Jose, CA 95134





More information about the drmaa-wg mailing list