[Pgi-wg] OGF PGI - AGU Execution Service Strawman Rendering

Etienne URBAH urbah at lal.in2p3.fr
Fri Oct 16 11:53:23 CDT 2009


Balazs, Morris, Luigi, Johannes and all,

Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI and 
the telephone conference of last week on 09 October 2009 :

-  Many thanks to Morris for having given detailed explanations on 
chapter 2.1 'CreateActivity Operation'.
    I now much better understand what is described inside an 'operation'.

-  Many thanks to Johannes for the Report and for the Action list.


Consistency between the CreateActivity operation and the State Model
--------------------------------------------------------------------
Inside chapter 2.1 'CreateActivity operation', I found discrepancies 
between the current description of the 'CreateActivity' operation and 
the PGI Single Job State Model :

-  Inside the PGI Single Job State Model, the Execution Service :
    - Allocates a Jobid (or an EPR) to the Job and sends it back to the 
Submitter at the end of the 'Submitted' state, BEFORE any storage 
allocation could be performed,
    - Notifies the submitter with allocated storage resources for 
stage-in only inside the 'Pre-processing:Hold' state.

-  The current description of the 'CreateActivity' operation encompass 
both the 'Submitted' and 'Pre-processing' states, and describes that the 
response can contain information about storage resources for stage-in.
    In fact :
    - The 'CreateActivity' operation should be limited to the 
'Submitted' state, and the response can only be only a vector of Jobids 
(or EPRs).  Information about storage resources for stage-in can only be 
given later, through a 'GetActivityInfo' request or a notification to 
the submitter.
    - In order to permit notification, the 'CreateActivity' operation 
should allow an 'Notification EPR' as an additional optional input 
parameter.

I have updated the document with changes highlighted at 
http://forge.gridforum.org/sf/go/doc15628?nav=1


Hold substate inside the 'Submitted' state ?
--------------------------------------------
See mail below.


Best regards.

-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                       Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
-----------------------------------------------------


On Thu, 13 Aug 2009, Etienne URBAH wrote:
> Balazs, Morris and all,
> 
> 
> Concerning the last OGF PGI telephone conference on 05 August 2009 :
> 
> 
> Meeting notes
> -------------
> I see NO meeting notes about this telephone conference at 
> http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/discussion.meetings 
> 
> 
> So I am working with my own (fragmentary) notes.
> 
> For all future OGF PGI telephone conferences, is it possible that a 
> secretary or a chair takes meeting notes, then writes them down in a 
> understandable form, and publish them at the above mentioned page ?
> 
> 
> Creation of a 'Submitted:Hold' substate ?
> -----------------------------------------
> First, as general rules, I consider that :
> 
> -  In order to AVOID keeping (potentially large) grid resources while 
> NOT computing, grid Jobs should be designed to be processed completely 
> automatically, with NO provision for 'Hold' substates,
> 
> -  A grid Job needing many 'Hold' substates can NOT be handled by an 
> automatic Submitter, but should be submitted by a human grid User as an 
> 'Interactive Job', as described for example at 
> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000000000000 
> 
> 
> 
> Someone asked for the creation of a 'Hold' substate inside the 
> 'Submitted' state, like inside other states.
> 
> This 'Submitted:Hold' substate would make sense only if the Job 
> Submitter could perform an operation on this substate.
> 
> In order to request such an operation, the Job Submitter needs the Jobid 
> (or Job EPR).
> 
> This Jobid (or Job EPR) is guaranteed to be allocated by the Execution 
> Service only at the END of the 'Submitted' state, but NOT before.
> 
> Therefore, I consider that the 'Submitted' state can NOT contain a 
> 'Hold' substate.
> 
> If anyone thinks otherwise, can he/she please present a convincing Use 
> Case ?
> 
> 
> Precisions about the 'Finished with Success or Error' state
> -----------------------------------------------------------
> Someone asked that the 'Error' case of the 'Finished with Success or 
> Error' state should be moved to the 'Failed' state.
> 
> In fact, inside the current Job State Model, a Job reaches the 'Finished 
> with Success or Error' state if and only if it successively reached the 
> end of following states, without failure or cancellation at the JOB level :
> -  'Pre-processing'
> -  'Delegated', whatever the Application result :
>    - Success = Application return code equal     to zero
>    - Error   = Application return code different of zero
> -  'Post-processing'
> 
> Inside the 'Finished with Success or Error' state :
> -  Success means 'Application return code was equal     to zero',
> -  Error   means 'Application return code was different of zero'.
> 
> I copied this behavior from the Job State Model of 'gLite', where the 
> 'Done' state contains both the 'Success' and 'Exit Code !=0' cases, as 
> can be seen in the 'bookkeeping information' at 
> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000000000000 
> 
> 
> 
> I consider this behavior design, and the strong separation between the 
> 'Failed' and 'Finished with Success or Error' states, as fully justified 
> by following reasons :
> 
> -  Whenever a Job reaches the 'Failed' state, the grid Execution Service 
> detected an unrecoverable inconsistency at the JOB level.
>    Therefore, the Job output sandbox and the post-processed Application 
> output files can potentially be NOT consistent and NOT even accessible 
> by the Job Submitter.
>    In order to investigate the Job failure, the grid User then needs 
> some grid knowledge (and often experience and expertise) to retrieve and 
> interpret :
>    - the Job failure code and message,
>    - the Job logging and bookkeeping, in comparison with the Job 
> description.
>    This 'grid level' investigation can sometimes prove that the cause of 
> the Job failure came from the Application, but is ALWAYS necessary.
> 
> -  Whenever a Job reaches the 'Finished with Success or Error' state, 
> the grid Execution Service could create the Job output sandbox, and 
> perform post-processing on Application output files, WITHOUT detecting 
> any unrecoverable inconsistency at the JOB level.
>    Therefore, the Job output sandbox, and the post-processed Application 
> output files, can be supposed to be consistent and easily accessible by 
> the Job Submitter.
>    On a non-zero return code of the Application, the grid User :
>    - first has to look (WITHOUT needing any grid knowledge) at the Job 
> output sandbox and at the post-processed Application output files for an 
> Application problem,
>    - before, if necessary, using grid knowledge (and often experience 
> and expertise) to provide any evidence that the Application error was 
> caused by a faulty Job description, the Batch system, or the grid 
> Execution Service.
> 
> As a summary, I consider that the 'Error' case of the 'Finished with 
> Success or Error' state should be kept as it is, and NOT be moved to the 
> 'Failed' state.
> 
> If anyone thinks otherwise, can he/she please present convincing reasons ?
> 
> 
> Strawman Rendering
> ------------------
> I will work on the ODT version of 'Strawman Rendering' at 
> http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
> 
> -  include the above precisions on states,
> 
> -  include the 'Types of grid Jobs' section of my 'PGI Execution Service 
> Overview' document,
> 
> -  check consistency, and present the relationships between the 
> operations described in chapter 2 'Interface: Execution Port-Type' and 
> the different states of the different types of grid Jobs.
> 
> 
> Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14 
> August 2009 at 16h CET.
> 
> Best regards.
> 
> -----------------------------------------------------
> Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
>                       Bat 200   91898 ORSAY    France
> Tel: +33 1 64 46 84 87      Skype: etienne.urbah
> Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
> -----------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5073 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20091016/0ec24b05/attachment.bin 


More information about the Pgi-wg mailing list