[SAGA-RG] SAGA Job State Model

Tue Mar 23 17:34:06 CDT 2010

Quoting [Ole Christian Weidner] (Mar 22 2010):
> 
> On Mar 22, 2010, at 8:25 AM, Andre Merzky wrote:
> 
> > Quoting [Ole Christian Weidner] (Mar 22 2010):
> >> 
> >> Aloha,
> >> 
> >> what was the reason again *not* to have a "pending" state in the
> >> saga job model? 
> > 
> > The decision on what states are on the top level of the SAGA state
> > model was based on the operations available in the API: only those
> > states got added which were explicitely reachable via some API
> > method.  
> 
> Ok, but why? This is IMHO a pretty random decision. 

Might as well be, true - some decision had to be made though, and
that seems as good a guideline as any other.

> > A 'Pending' state cannot be reached (or left, depending on
> > semantics) by any SAGA API call, thus that is only available as
> > state detail, not as top level state.
> 
> What about saga::job::New --run()--> saga::job::Pending. Also, you
> could say the same thing about Done and Failed: these states are
> not explicitly reachable via a call... and wait() doesn't really
> count! if it does, you could also use it to transition from
> Pending to Running, Failed, etc. 

New->run()->Pending:  Sure, but then you have a transition
Pending->Running which is not expressed at API level.

Done/Failed: correct of course, but yes, we counted wait() to be the
point where the application can sync with the job state.

> >> I'm implementing the third job adaptor (gLite
> >> CREAM) for saga and again, I don't know if I should map gLite's
> >> "pending" state to saga::job::New or saga::job::Running.
> > 
> > It should go to Running (as almost all substates IMHO).  New is
> > usually defined so that a job does not yet have a backend
> > representation.  
> 
> Usually? 

Yes: most states in middleware systems are assigned to jobs which
have a backend representation, and are thus in Running state from
the SGA point of view.  An exception I can think of are substates to
Suspended (UserSuspended, SystemSuspended etc), and substates to the
final states (UserFailed, ApplicationFailed, SystemFailed,
UserCanceled, SystemCanceled etc).  Most other states we encountered
and which are specified for the various systems describe details of
a live job (after being accepted by the backend, before being
suspended or finished), and can thus be mapped to Running.

> > In Pending states however, most middleware do
> > already maintain job state.
> 
> What do you mean by maintaining a job state? 

A better way to express this may be to say: the job has a
representation in the backenend.  I.e., the backend accepted the job
creation request and a job-id exists which uniquely identifies the
job.

> >> Most of the middleware API's out there come with a plethora of
> >> states (e.g. gLite: 11), but most of them map naturally map to
> >> one of the saga job model's states. "Pending" is a state pretty
> >> much used by everyone (Condor, PBS, LSF, Globus, gLite,
> >> GridSAM) and it really doesn't map to saga's model. IMHO it's a
> >> major design flaw - how could this fall through the sieve? Or
> >> is there a  reason behind this?
> > 
> > See above.  As you say, there is a plethora of states, and many
> > are important for specific use cases.  Other states have been
> > candidates for SAGA, such as StageIn and StageOut, or Hold, for
> > all of which exist interesting use cases.  But again: it did not
> > seem very useful to expose states on the top level which cannot
> > be reached via API calls - they are then only useful for
> > informational purposes.  As such, they are still available in
> > the state details.
> 
> But again: why didn't it seem very useful? ;-) 
> 
> I would be perfectly happy using the state detail. The only
> problem with them is that they're absolutely useless without any
> formalization. Do you think it would make sense to define an
> extended state model (on implementation level) for the state
> details? This is IMHO the only way to make use of it
> programatically. 

The state detail format is specified in GFD.90, as

  State details in SAGA SHOULD be formatted as follows: 
    â€™<model>:<state>â€™ 
  with valid models being â€BESâ€, â€DRMAAâ€, or other implementation
  speciï¬c models. For example, a state detail for the BES state
  â€™StagingInâ€™ would be rendered as â€™BES:StagingInâ€™), and would be a
  substate of Running. If no state details are available, the metric
  is still available, but it has alwaysanempty string value. 

So, 'gLite:Pending' would be what you are looking for, and is should
be possible to be interpreted by the application (it needs to have a
notion what 'Pending' means, and need to look on the second part of
the state detail).

The only more convenient way to expose the state detail I could
think of would be to expose the state details components
individually

  state_detail_model = gLite
  state_detail_value = Pending

> > Also, as a last point: the more states we add to SAGA, the more
> > difficult it is to map to a specific backend state model (DRMAA,
> > AWS, local, ssh and BES come to my mind which do not have a
> > Pending, for example).
> 
> I don't think that this is a valid point. Why does it become more
> difficult? Especially if we're talking about a state that cannot
> be reached explicitly: you don't have to worry about it at all. If
> SSH doesn't have a "PENDING" state, it will simply never reach it! 

The state model is getting more complicated, as you need to allow
state transitions from New to Running to cater for those backends.

For example, we have been considering initially to use the DRMAA.v1
state model, as that was the state of the art at that poit in time
(long time ago).  DRMAA has the following states:

 UNDETERMINED, 
 QUEUED_ACTIVE, 
 SYSTEM_ON_HOLD, 
 USER_ON_HOLD, 
 USER_SYSTEM_ON_HOLD, 
 RUNNING, 
 SYSTEM_SUSPENDED, 
 USER_SUSPENDED, 
 USER_SYSTEM_SUSPENDED, 
 DONE, 
 FAILED

It turned out to be hard to map the globus or gLite states to that
model w/o ending up with an insanely complex state mapping rules.
Thus we went for the simplest state model possible.

Let me turn the question around: what exactly is the use case you
need the Pending state for, and why can't that be solved with the
state_detail?

Finally: if you and other strongly feel that the SAGA state model is
too simple, or the state detail is not accessible enough, we should
certainly reopen the discussion on how those are rendered in the
API.  I doubt that it would be prudent to just change our
implementation though, w/o revising the spec first.

Cheers, Andre.

-- 
Nothing is ever easy.