[Pgi-wg] OGF PGI - Job State Model - Execution Service Strawman

Etienne URBAH urbah at lal.in2p3.fr
Thu Aug 13 16:49:56 CDT 2009


Balazs, Morris and all,


Concerning the last OGF PGI telephone conference on 05 August 2009 :


Meeting notes
-------------
I see NO meeting notes about this telephone conference at 
http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/discussion.meetings

So I am working with my own (fragmentary) notes.

For all future OGF PGI telephone conferences, is it possible that a 
secretary or a chair takes meeting notes, then writes them down in a 
understandable form, and publish them at the above mentioned page ?


Creation of a 'Submitted:Hold' substate ?
-----------------------------------------
First, as general rules, I consider that :

-  In order to AVOID keeping (potentially large) grid resources while 
NOT computing, grid Jobs should be designed to be processed completely 
automatically, with NO provision for 'Hold' substates,

-  A grid Job needing many 'Hold' substates can NOT be handled by an 
automatic Submitter, but should be submitted by a human grid User as an 
'Interactive Job', as described for example at 
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000000000000 



Someone asked for the creation of a 'Hold' substate inside the 
'Submitted' state, like inside other states.

This 'Submitted:Hold' substate would make sense only if the Job 
Submitter could perform an operation on this substate.

In order to request such an operation, the Job Submitter needs the Jobid 
(or Job EPR).

This Jobid (or Job EPR) is guaranteed to be allocated by the Execution 
Service only at the END of the 'Submitted' state, but NOT before.

Therefore, I consider that the 'Submitted' state can NOT contain a 
'Hold' substate.

If anyone thinks otherwise, can he/she please present a convincing Use 
Case ?


Precisions about the 'Finished with Success or Error' state
-----------------------------------------------------------
Someone asked that the 'Error' case of the 'Finished with Success or 
Error' state should be moved to the 'Failed' state.

In fact, inside the current Job State Model, a Job reaches the 'Finished 
with Success or Error' state if and only if it successively reached the 
end of following states, without failure or cancellation at the JOB level :
-  'Pre-processing'
-  'Delegated', whatever the Application result :
    - Success = Application return code equal     to zero
    - Error   = Application return code different of zero
-  'Post-processing'

Inside the 'Finished with Success or Error' state :
-  Success means 'Application return code was equal     to zero',
-  Error   means 'Application return code was different of zero'.

I copied this behavior from the Job State Model of 'gLite', where the 
'Done' state contains both the 'Success' and 'Exit Code !=0' cases, as 
can be seen in the 'bookkeeping information' at 
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000000000000


I consider this behavior design, and the strong separation between the 
'Failed' and 'Finished with Success or Error' states, as fully justified 
by following reasons :

-  Whenever a Job reaches the 'Failed' state, the grid Execution Service 
detected an unrecoverable inconsistency at the JOB level.
    Therefore, the Job output sandbox and the post-processed Application 
output files can potentially be NOT consistent and NOT even accessible 
by the Job Submitter.
    In order to investigate the Job failure, the grid User then needs 
some grid knowledge (and often experience and expertise) to retrieve and 
interpret :
    - the Job failure code and message,
    - the Job logging and bookkeeping, in comparison with the Job 
description.
    This 'grid level' investigation can sometimes prove that the cause 
of the Job failure came from the Application, but is ALWAYS necessary.

-  Whenever a Job reaches the 'Finished with Success or Error' state, 
the grid Execution Service could create the Job output sandbox, and 
perform post-processing on Application output files, WITHOUT detecting 
any unrecoverable inconsistency at the JOB level.
    Therefore, the Job output sandbox, and the post-processed 
Application output files, can be supposed to be consistent and easily 
accessible by the Job Submitter.
    On a non-zero return code of the Application, the grid User :
    - first has to look (WITHOUT needing any grid knowledge) at the Job 
output sandbox and at the post-processed Application output files for an 
Application problem,
    - before, if necessary, using grid knowledge (and often experience 
and expertise) to provide any evidence that the Application error was 
caused by a faulty Job description, the Batch system, or the grid 
Execution Service.

As a summary, I consider that the 'Error' case of the 'Finished with 
Success or Error' state should be kept as it is, and NOT be moved to the 
'Failed' state.

If anyone thinks otherwise, can he/she please present convincing reasons ?


Strawman Rendering
------------------
I will work on the ODT version of 'Strawman Rendering' at 
http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :

-  include the above precisions on states,

-  include the 'Types of grid Jobs' section of my 'PGI Execution Service 
Overview' document,

-  check consistency, and present the relationships between the 
operations described in chapter 2 'Interface: Execution Port-Type' and 
the different states of the different types of grid Jobs.


Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14 
August 2009 at 16h CET.

Best regards.

-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                       Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
-----------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5060 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20090813/7da00605/attachment.bin 


More information about the Pgi-wg mailing list