[saga-rg] Job States

Thu Aug 4 02:03:48 CDT 2005

Quoting [Christopher Smith] (Aug 04 2005):
> 
> On 29/7/05 10:40, "Andre Merzky" <andre at merzky.net> wrote:
> 
> > SAGA Jobs have currently following states:
> > 
> [chop]
> > 
> > I got the comment from colleques that PreStaging and
> > PostStaging are missing.  Indeed these stages seem not to
> > fir into any of the above ones.  Running would be a
> > candidate, but since the remote resource is not neccessarily
> > used anymore, that might be confusing.  Should these stages
> > be added?  However, they do also not appear in the DRMAA
> > specification AFAIK.
> > 
> > Any thoughts?
> > 
> These states can be added.
> 
> Also, there is a more complicated state model for "activities" emerging from
> the OGSA-BES work, that also includes sub-states for file staging, etc, etc.
> We can perhaps incorporate some of that as well, although I'm happy with
> general Pre-execution and Post-execution states to cover all of this.

Pre-Execution/Post-Execution sounds good to me.  I guess we
don't want to have a too complex state model, and these two
can incorporate whatever SAGA or the backend seems necessary
to do before/after the job is actually running...

> Perhaps we can discuss on the call tomorrow.

Great.

> > Another question: Assume I check a job status and find it
> > 'DoneFail' - how can I determine the reason of failure?  It
> > would be useful to know the status the job was in before it
> > failed (e.g. if it was prestating, I know then that staging
> > failed, and the job never really started).  Also it would be
> > nice to be able to query for any error message.
> >
> 
> There is the getJobExitStatus method on the Job interface so that you can
> get things like the exit code and the signal number that caused termination.
> 
> As for querying the state which preceded the failure, it sounds like a good
> idea (LSF does this by keeping a history log for jobs that can be queried
> via a "bhist" command). Perhaps adding an optional string to the
> JobExitStatus class would be sufficient for this kind of extended
> information? The problem is that this stuff is not particularly standardized
> across resource managers I think.

I think a (potentially) extensive error message on the exit
status object is the simpliest solution - if job failed,
look there to find some infos about the reason, if
available.  Nice.

Cheers, Andre.

> > I think that the error query is distinct from the exception
> > mechanism we will have: a job entering DoneFail should NOT
> > throw an exception in my opinion - but that leads to above
> > question: how can I query the error leading ot the DoneFail
> > state?
>
> I agree.
> 
> -- Chris

-- 
+-----------------------------------------------------------------+
| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
| Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
| De Boelelaan 1083a                | www:  http://www.merzky.net |
| 1081 HV Amsterdam, Netherlands    |                             |
+-----------------------------------------------------------------+