[drmaa-wg] Synchronizing Against Waited Jobs

Roger Brobst rogerb at cadence.com
Mon Oct 24 10:15:38 CDT 2005


My opinion ...

Because a DRMAA implementation is not required to
retain information about jobs which have been reaped,
drmaa_synchronize should not be required to
distinguish between non-existant and reaped jobs.

A drmaa_synchronize implementation should return
DRMAA_ERRNO_INVALID_JOB if a provided jobID is
unrecognized.

If a drmaa_synchronize implementation successfully
validates a jobID for a reaped job, it may
return DRMAA_ERRNO_SUCCESS.

-Roger


In a previous e-mail, Daniel Templeton wrote:
> So, the only person who hasn't weighed in is Roger.
> Care to offer an opinion?
> 
> Daniel
> 
> Peter Troeger wrote On 10/21/05 10:21,:
> 
> >I support the argumentation of Hrabri. DRMAA introduced "dispose=true" 
> >in the interface, so resource consumption seems to be an issue. If a job 
> >was subject to drmaa_wait(), and the data was disposed, nothing should 
> >be left in memory about this job. IMHO the job becomes completely 
> >unknown to the library after this point.
> >
> >
> >BTW, this holds also for the current Condor DRMAA implementation. It is 
> >also reasoned by the behavior of the underlying Condor system. If a job 
> >was finished, only the log files can tell you what happened. The Condor 
> >DRMAA library uses such a log file for each job, and if you execute 
> >drmaa_wait(dispose=true), the log file and in-memory structures for the 
> >job are removed. Calling drmaa_synchronize() after this results in 
> >DRMAA_ERRNO_INVALID_JOB.
> >
> >Things might be clearer if we would have an explicit drmaa_dispose_job() 
> >function.
> >
> >Regards,
> >Peter.
> >
> >
> >
> >Rajic, Hrabri schrieb:
> >
> >  
> >
> >>My wig is in dry cleaning.  Nevertheless, here is my short take on this.
> >>
> >>
> >>If an implementation has handy job_id's it could conveniently make good
> >>determination which jobs are invalid (do not exist) and throw
> >>DRMAA_ERRNO_INVALID_JOB.   IMHO, it is not a big deal if the routine
> >>gives imprecise diagnostics if it is forced to do memory garbage
> >>collection earlier.  Quality of implementation term comes to mind, but
> >>that quality could come at the expense of being memory hog that in turn
> >>could lead to paging - quite dubious. 
> >>See, the implementations might differently handle jobs that did not come
> >>    
> >>
> >>from the current session, so we could not be precise here either.
> >  
> >
> >>The important thing for the user is to synchronize i.e. block program
> >>    
> >>
> >>from continuing if there are running remote jobs.  
> >  
> >
> >> 
> >>Dispose = true helps get rid of the rusage info to free DRMAA
> >>implementations of heavy memory requirements when it matters, so keeping
> >>all the past job_ids for providing precise exit errors runs contrary to
> >>the goal of lessening memory requirements in the same routine.
> >>
> >>My 2 pfennigs,
> >>
> >>Hrabri
> >>
> >> 
> >>
> >>    
> >>
> >>>-----Original Message-----
> >>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf
> >>>   
> >>>
> >>>      
> >>>
> >>Of
> >> 
> >>
> >>    
> >>
> >>>Daniel Templeton
> >>>Sent: Wednesday, October 19, 2005 11:37 AM
> >>>To: DRMAA Working Group
> >>>Subject: [drmaa-wg] Synchronizing Against Waited Jobs
> >>>
> >>>We have found a bug in the SGE DRMAA implementation, (I know! It's
> >>>shocking!) but Andreas and I can't agree on what the fix should be.
> >>>   
> >>>
> >>>      
> >>>
> >>The
> >> 
> >>
> >>    
> >>
> >>>issue is that in the current implementation, synchronizing against jobs
> >>>that did not come from the current session returns DRMAA_ERRNO_SUCCESS.
> >>>The part about which we disagree is what should happen when
> >>>synchronizing against jobs that are from the current session, but that
> >>>have already ended and have already had drmaa_wait() (or
> >>>drmaa_synchronize() with dispose=true) called against them.
> >>>
> >>>My stance is that one can extrapolate from the drmaa_wait() function
> >>>that there is no difference between jobs which don't exist (at all or
> >>>   
> >>>
> >>>      
> >>>
> >>in
> >> 
> >>
> >>    
> >>
> >>>the current session) and jobs whose exit information has been disposed
> >>>(via drmaa_wait() or drmaa_synchronize()).  Therefore, calling
> >>>drmaa_synchronize() on jobs which have already had drmaa_wait() called
> >>>against them should return DRMAA_ERRNO_INVALID_JOB.
> >>>
> >>>Andreas holds that it can be inferred from the lack of the above
> >>>statement in the spec, that drmaa_synchronize() handles such jobs
> >>>differently from drmaa_wait().  Because drmaa_synchronize() does not
> >>>need the jobs' exit information to succeed, it should be able to
> >>>   
> >>>
> >>>      
> >>>
> >>operate
> >> 
> >>
> >>    
> >>
> >>>on jobs whose exit information has already been disposed.  Therefore,
> >>>calling drmaa_synchronize() on jobs which have already had drmaa_wait()
> >>>called against them should return DRMAA_ERRNO_SUCCESS.
> >>>
> >>>I can agree that Andreas' position makes theoretical sense, but I
> >>>believe it runs contrary to the stated goal of minimizing the
> >>>requirements on the implementing DRMS.  In order to implement a
> >>>drmaa_synchronize() that can distinguish between job's that have been
> >>>disposed and jobs that never existed, the DRMAA implementation must
> >>>   
> >>>
> >>>      
> >>>
> >>keep
> >> 
> >>
> >>    
> >>
> >>>a list of the ids of every job that has ever been submitted in the
> >>>current session, and with every drmaa_synchronize() call, the list must
> >>>be searched to validate the synchronize id list.  And for what?
> >>>DRMAA_JOB_IDS_ALL covers every case I can think of where the behavior
> >>>Andreas described would be useful. To me, it sounds like a lot of extra
> >>>work for the DRMAA implementation with no tangible benefit.
> >>>
> >>>On what Andreas and I can agree is that if we decide he is right, we
> >>>will close the bug as "won't fix" because the fix will be worse than
> >>>   
> >>>
> >>>      
> >>>
> >>the
> >> 
> >>
> >>    
> >>
> >>>bug.  In any case, we should probably have a tracker item to make the
> >>>final decision explicit in the spec.
> >>>
> >>>What say you, oh, wise ones?
> >>>
> >>>Daniel
> >>>
> >>>--
> >>>***************************************************
> >>>*        Daniel Templeton   ERGB01 x60220         *
> >>>*       Staff Engineer, Sun N1 Grid Engine        *
> >>>***************************************************
> >>>* "So let the sunshine in.  Face it with a grin.  *
> >>>*  Smilers never lose, and frowners never win."   *
> >>>*      -Let the Sunshine In, Pebbles Flintstone   *
> >>>***************************************************
> >>>
> >>>   
> >>>
> >>>      
> >>>
> >> 
> >>
> >>    
> >>
> >
> >  
> >
> 
> -- 
> ***************************************************
> *        Daniel Templeton   ERGB01 x60220         *
> *       Staff Engineer, Sun N1 Grid Engine        *
> ***************************************************
> * "So let the sunshine in.  Face it with a grin.  *
> *  Smilers never lose, and frowners never win."   *
> *      -Let the Sunshine In, Pebbles Flintstone   *
> ***************************************************
> 





More information about the drmaa-wg mailing list