[DRMAA-WG] DRMAA2: Job State Model Discussion

Piotr Domagalski piotr.domagalski at fedstage.com
Mon Jan 19 15:23:45 CST 2009


On Sun, Jan 18, 2009 at 9:18 PM, Peter Tröger <peter at troeger.eu> wrote:
> DRMAA1 allows the following job states:
> UNDETERMINED
> QUEUED_ACTIVE
> SYSTEM_ON_HOLD
> USER_ON_HOLD
> USER_SYSTEM_ON_HOLD
> RUNNING
> SYSTEM_SUSPENDED
> USER_SUSPENDED
> USER_SYSTEM_SUSPENDED
> DONE
> FAILED
> [...]
> Which states need to be removed / added from the viewpoint of PBS,
> LSF, Condor, SGE, Unicore, GridWay, OGSA-BES and SAGA ?

Some of DRMSs didn't have either suspend or hold state -- don't
remember which ones exactly (Condor?). What is important in my
opinion, is to allow drmaa_control() to return some kind of "not
implemented" error.

> Which of the job states were never realized in DRMAA implementations ?
> Can we remove them ?

I think the differentiation between system and user hold/suspend is
very much SGE specific as far as I remember.

Personally, I also hate UNDETERMINED state. If this was only up to my
decision, I'd surely remove it. Not being able to get job state result
is error in most cases, so having a special state for that is useless
. For example, in our case, when we're implementing BES on top of
DRMAA, I loop for a few times when I get UNDETERMINED state and throw
a fault eventually. I'd rather have DRMAA implementation loop for a
few times and if that gives nothing, return an error.

> Which of the states are too generic ? How do we resolve this ?

I don't like FAILED state meaning both failed and terminated. It could
be split into FAILED and TERMINATED states. But then, we need to
discuss how this maps to what drmaa_wait() returns. By the way, did
you have an opportunity to discuss the future of drmaa_wif*()
functions?

> Do we want an extensible job state model as in OGSA-BES ? If yes, how
> to realize ?

I think that this would be cool to have. DRMAA is mostly viewed as a
low-level API to local DRMS. For that scenarios, the current state
model is OK. On the other hand, there are scenarios when DRMAA is used
to add simple API on top of some higher-level system -- as in GridWay
AFAIR. We were also actually thinking of implementing DRMAA interface
to a remote OGSA-BES service. For that kind of scenarios, an
extensible job mode (e.g. for stage in/out states which are rather
rarely observable in local DRMS) would be useful.

If there are many votes for that, we might start a discussion on how
to actually specify/implement it. Until then, I don't think there's
any use going into the details.

-- 
Piotr Domagalski
FedStage Systems Ltd.


More information about the drmaa-wg mailing list