[Pgi-wg] OGF PGI - Job State Model - Execution Service Strawman

Morris Riedel m.riedel at fz-juelich.de
Fri Aug 14 07:16:58 CDT 2009


Dear Etienne,

>- For all future OGF PGI telephone conferences, is it possible that a 
secretary or a chair takes meeting notes, then writes them down in a 
understandable form, and publish them at the above mentioned page ?

Johannes did a well job and took notes of the last meeting - he will be put
them in gridforge soon and highlight the most important points today in the
telcon.


>-Strawman Rendering
>-------------------
>-I will work on the ODT version of 'Strawman Rendering' at
>-http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
>-
>--  include the above precisions on states,
>-
>--  include the 'Types of grid Jobs' section of my 'PGI Execution Service
>-Overview' document,

Nice - please don't forget to track changes.

Take care,
Morris



------------------------------------------------------------
Morris Riedel
SW - Engineer
Distributed Systems and Grid Computing Division
Jülich Supercomputing Centre (JSC)
Forschungszentrum Juelich
Wilhelm-Johnen-Str. 1
D - 52425 Juelich
Germany

Email: m.riedel at fz-juelich.de
Info: http://www.fz-juelich.de/jsc/JSCPeople/riedel
Phone: +49 2461 61 - 3651
Fax: +49 2461 61 - 6656

Skype: MorrisRiedel

"We work to better ourselves, and the rest of humanity"

Sitz der Gesellschaft: Jülich
Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
Vorsitzende des Aufsichtsrats: MinDirig'in Bärbel Brumme-Bothe
Vorstand: Prof. Dr. Achim Bachem (Vorsitzender), 
Dr. Ulrich Krafft (stellv. Vorsitzender)


>------Original Message-----
>-From: Etienne URBAH [mailto:urbah at lal.in2p3.fr]
>-Sent: Thursday, August 13, 2009 11:50 PM
>-To: balazs.konya at hep.lu.se; Riedel, Morris; pgi-wg at ogf.org
>-Cc: lodygens at lal.in2p3.fr; edges-na3 at mail.edges-grid.eu
>-Subject: OGF PGI - Job State Model - Execution Service Strawman
>-
>-Balazs, Morris and all,
>-
>-
>-Concerning the last OGF PGI telephone conference on 05 August 2009 :
>-
>-
>-Meeting notes
>--------------
>-I see NO meeting notes about this telephone conference at
>-http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-
>-wg/discussion.meetings
>-
>-So I am working with my own (fragmentary) notes.
>-
>-For all future OGF PGI telephone conferences, is it possible that a
>-secretary or a chair takes meeting notes, then writes them down in a
>-understandable form, and publish them at the above mentioned page ?
>-
>-
>-Creation of a 'Submitted:Hold' substate ?
>------------------------------------------
>-First, as general rules, I consider that :
>-
>--  In order to AVOID keeping (potentially large) grid resources while
>-NOT computing, grid Jobs should be designed to be processed completely
>-automatically, with NO provision for 'Hold' substates,
>-
>--  A grid Job needing many 'Hold' substates can NOT be handled by an
>-automatic Submitter, but should be submitted by a human grid User as an
>-'Interactive Job', as described for example at
>-https://edms.cern.ch/file/722398//gLite-3-
>-UserGuide.html#SECTION00084400000000000000
>-
>-
>-
>-Someone asked for the creation of a 'Hold' substate inside the
>-'Submitted' state, like inside other states.
>-
>-This 'Submitted:Hold' substate would make sense only if the Job
>-Submitter could perform an operation on this substate.
>-
>-In order to request such an operation, the Job Submitter needs the Jobid
>-(or Job EPR).
>-
>-This Jobid (or Job EPR) is guaranteed to be allocated by the Execution
>-Service only at the END of the 'Submitted' state, but NOT before.
>-
>-Therefore, I consider that the 'Submitted' state can NOT contain a
>-'Hold' substate.
>-
>-If anyone thinks otherwise, can he/she please present a convincing Use
>-Case ?
>-
>-
>-Precisions about the 'Finished with Success or Error' state
>------------------------------------------------------------
>-Someone asked that the 'Error' case of the 'Finished with Success or
>-Error' state should be moved to the 'Failed' state.
>-
>-In fact, inside the current Job State Model, a Job reaches the 'Finished
>-with Success or Error' state if and only if it successively reached the
>-end of following states, without failure or cancellation at the JOB level
:
>--  'Pre-processing'
>--  'Delegated', whatever the Application result :
>-    - Success = Application return code equal     to zero
>-    - Error   = Application return code different of zero
>--  'Post-processing'
>-
>-Inside the 'Finished with Success or Error' state :
>--  Success means 'Application return code was equal     to zero',
>--  Error   means 'Application return code was different of zero'.
>-
>-I copied this behavior from the Job State Model of 'gLite', where the
>-'Done' state contains both the 'Success' and 'Exit Code !=0' cases, as
>-can be seen in the 'bookkeeping information' at
>-https://edms.cern.ch/file/722398//gLite-3-
>-UserGuide.html#SECTION00084100000000000000
>-
>-
>-I consider this behavior design, and the strong separation between the
>-'Failed' and 'Finished with Success or Error' states, as fully justified
>-by following reasons :
>-
>--  Whenever a Job reaches the 'Failed' state, the grid Execution Service
>-detected an unrecoverable inconsistency at the JOB level.
>-    Therefore, the Job output sandbox and the post-processed Application
>-output files can potentially be NOT consistent and NOT even accessible
>-by the Job Submitter.
>-    In order to investigate the Job failure, the grid User then needs
>-some grid knowledge (and often experience and expertise) to retrieve and
>-interpret :
>-    - the Job failure code and message,
>-    - the Job logging and bookkeeping, in comparison with the Job
>-description.
>-    This 'grid level' investigation can sometimes prove that the cause
>-of the Job failure came from the Application, but is ALWAYS necessary.
>-
>--  Whenever a Job reaches the 'Finished with Success or Error' state,
>-the grid Execution Service could create the Job output sandbox, and
>-perform post-processing on Application output files, WITHOUT detecting
>-any unrecoverable inconsistency at the JOB level.
>-    Therefore, the Job output sandbox, and the post-processed
>-Application output files, can be supposed to be consistent and easily
>-accessible by the Job Submitter.
>-    On a non-zero return code of the Application, the grid User :
>-    - first has to look (WITHOUT needing any grid knowledge) at the Job
>-output sandbox and at the post-processed Application output files for an
>-Application problem,
>-    - before, if necessary, using grid knowledge (and often experience
>-and expertise) to provide any evidence that the Application error was
>-caused by a faulty Job description, the Batch system, or the grid
>-Execution Service.
>-
>-As a summary, I consider that the 'Error' case of the 'Finished with
>-Success or Error' state should be kept as it is, and NOT be moved to the
>-'Failed' state.
>-
>-If anyone thinks otherwise, can he/she please present convincing reasons ?
>-
>-
>-Strawman Rendering
>-------------------
>-I will work on the ODT version of 'Strawman Rendering' at
>-http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
>-
>--  include the above precisions on states,
>-
>--  include the 'Types of grid Jobs' section of my 'PGI Execution Service
>-Overview' document,
>-
>--  check consistency, and present the relationships between the
>-operations described in chapter 2 'Interface: Execution Port-Type' and
>-the different states of the different types of grid Jobs.
>-
>-
>-Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14
>-August 2009 at 16h CET.
>-
>-Best regards.
>-
>------------------------------------------------------
>-Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
>-                       Bat 200   91898 ORSAY    France
>-Tel: +33 1 64 46 84 87      Skype: etienne.urbah
>-Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
>------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3550 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20090814/c59962b1/attachment-0001.bin 


More information about the Pgi-wg mailing list