[Pgi-wg] OGF PGI - AGU Execution Service Strawman Rendering

Etienne URBAH urbah at lal.in2p3.fr
Mon Oct 26 13:48:15 CDT 2009


Aleksandr, Balazs, Morris, Luigi, Johannes and all,


Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI :

Aleksandr KONSTANTINOV and myself had a telephone talk on Friday 23 
October at 16h, and we pointed the question 'if we want to tie state 
changes to operation tightly or operation may aggregate multiple state 
changes'.


I am NOT an expert on Web Services.  I can imagine 3 ways to implement 
message transfers (between Job Submitter and Execution Service) 
according to the 'Single Job State Model' :



If Execution Service and Job Submitter both implement notifications
-------------------------------------------------------------------
This asynchronous method is most efficient, but is NOT mandatory.

-  On job submission :
    - The Job Submitter sends a 'CreateActivity' request containing 2 
parameters :
      - The vector of Job descriptions,
      - The URL for notification.
    - The Execution Service immediately sends back a 'CreateActivity' 
response containing Jobids (or error messages).

-  The Job Submitter waits for notifications.

-  Whenever the Job Submitter receives from the Execution Service a 
'Hold' notification (containing for example the location for manual file 
staging) :
    - He/she performs the appropriate work (for example manual file 
staging),
    - Then he/she sends a 'ChangeActivityStatus' request (for example to 
resume Job processing),
    - The Execution Service immediately sends back a 
'ChangeActivityStatus' response describing acceptation or refusal.

-  As soon as the Job is 'Failed' or 'Finished with success or error', 
the Job Submitter receives from the Execution Service the appropriate 
notification.

-  Then, the Job Submitter may send a 'WipeActivity' request to purge 
the Job.



If Execution Service or Job Submitter does NOT implement notifications
----------------------------------------------------------------------
Then the Job submitter has to poll the Job status.

-  On job submission :
    - The Job Submitter sends a 'CreateActivity' request containing only 
1 parameter :  The vector of Job descriptions,
    - The Execution Service immediately sends back a 'CreateActivity' 
response containing Jobids (or error messages).

-  From time to time :
    - The Job Submitter sends a 'GetActivityStatus' request,
    - The Execution Service immediately sends back a 'GetActivityStatus' 
response describing the Job status and appropriate additional 
information (for example the location for manual file staging).

-  Whenever necessary (for example the Job status has just become 'Hold') :
    - The Job Submitter performs the appropriate work (for example 
manual file staging),
    - Then he/she sends a 'ChangeActivityStatus' request (for example to 
resume Job processing),
    - The Execution Service immediately sends back a 
'ChangeActivityStatus' response describing acceptation or refusal.

-  When the Job status has become 'Failed' or 'Finished with success or 
error', the Job Submitter may send a 'WipeActivity' request to purge the 
Job.

This method provides consistency with the 'Single Job State Model', but 
requires repetitive 'GetActivityStatus' requests.



Method minimizing 'GetActivityStatus' requests without notifications
--------------------------------------------------------------------
As far as I have understood from Aleksandr's explanations :

-  On job submission, the Job Submitter sends a 'CreateActivity' request 
containing only 1 parameter :  The vector of Job descriptions.

-  The Execution Service sends back a 'CreateActivity' response 
containing, for each Job :
    - Its Jobid (or error message),
    - If necessary, the location for file stage-in.

- If manual file stage-in is necessary :
    - The Job Submitter :
      - performs the manual file stage-in,
      - sends a 'ChangeActivityStatus' request (for example to resume 
Job processing).
    - The Execution Service sends back a 'ChangeActivityStatus' response 
describing acceptation or refusal.

-  From time to time :
    - The Job Submitter sends a 'GetActivityStatus' request,
    - The Execution Service immediately sends back a 'GetActivityStatus' 
response describing the Job status and appropriate additional 
information (for example the location for manual file stage-out).

-  Whenever necessary (for example the Job status has just become 
'Post-processing:Hold:Manual-Stage-Out') :
    - The Job Submitter performs the appropriate work (for example 
manual file stage-out),
    - Then he/she sends a 'ChangeActivityStatus' request (for example to 
resume Job processing),
    - The Execution Service sends back a 'ChangeActivityStatus' response 
describing acceptation or refusal (for example the Job status has become 
'Failed' or 'Finished with success or error').

-  When the Job status has become 'Failed' or 'Finished with success or 
error', the Job Submitter may send a 'WipeActivity' request to purge the 
Job.


This method minimizes 'GetActivityStatus' requests, but :

-  The time between the 'CreateActivity' request and the 
'CreateActivity' response (containing the location for file stage-in) 
can be very long (for example if the Job must stay a long time in the 
'Submitted' state waiting for computing and/or storage resources ).

-  Repetitive 'GetActivityStatus' requests are still necessary for the 
Job Submitter to learn that a Job has reached the 
'Post-processing:Hold:Manual-Stage-Out' state (or the 'Finished with 
success or error' state if no manual stage-out is necessary).

So, I can NOT guarantee the consistency of this method with the 'Single 
Job State Model'.


Please study the above 3 methods carefully, make up your mind, and send 
comments or remarks, so that we can together improve the design of the 
messages, and achieve consensus.


Besides, I will probably NOT be able to attend the PGI telephone 
conferences on 30 October and 06 November 2009.


Best regards.

-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                       Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
-----------------------------------------------------


On Fri, 16 Oct 2009, Etienne URBAH wrote:
> Balazs, Morris, Luigi, Johannes and all,
> 
> Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI and 
> the telephone conference of last week on 09 October 2009 :
> 
> -  Many thanks to Morris for having given detailed explanations on 
> chapter 2.1 'CreateActivity Operation'.
>    I now much better understand what is described inside an 'operation'.
> 
> -  Many thanks to Johannes for the Report and for the Action list.
> 
> 
> Consistency between the CreateActivity operation and the State Model
> --------------------------------------------------------------------
> Inside chapter 2.1 'CreateActivity operation', I found discrepancies 
> between the current description of the 'CreateActivity' operation and 
> the PGI Single Job State Model :
> 
> -  Inside the PGI Single Job State Model, the Execution Service :
>    - Allocates a Jobid (or an EPR) to the Job and sends it back to the 
> Submitter at the end of the 'Submitted' state, BEFORE any storage 
> allocation could be performed,
>    - Notifies the submitter with allocated storage resources for 
> stage-in only inside the 'Pre-processing:Hold' state.
> 
> -  The current description of the 'CreateActivity' operation encompass 
> both the 'Submitted' and 'Pre-processing' states, and describes that the 
> response can contain information about storage resources for stage-in.
>    In fact :
>    - The 'CreateActivity' operation should be limited to the 'Submitted' 
> state, and the response can only be only a vector of Jobids (or EPRs).  
> Information about storage resources for stage-in can only be given 
> later, through a 'GetActivityInfo' request or a notification to the 
> submitter.
>    - In order to permit notification, the 'CreateActivity' operation 
> should allow an 'Notification EPR' as an additional optional input 
> parameter.
> 
> I have updated the document with changes highlighted at 
> http://forge.gridforum.org/sf/go/doc15628?nav=1
> 
> 
> Hold substate inside the 'Submitted' state ?
> --------------------------------------------
> See mail below.
> 
> 
> Best regards.
> 
> -----------------------------------------------------
> Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
>                       Bat 200   91898 ORSAY    France
> Tel: +33 1 64 46 84 87      Skype: etienne.urbah
> Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
> -----------------------------------------------------
> 
> 
> On Thu, 13 Aug 2009, Etienne URBAH wrote:
>> Balazs, Morris and all,
>>
>>
>> Concerning the last OGF PGI telephone conference on 05 August 2009 :
>>
>>
>> Meeting notes
>> -------------
>> I see NO meeting notes about this telephone conference at 
>> http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/discussion.meetings 
>>
>>
>> So I am working with my own (fragmentary) notes.
>>
>> For all future OGF PGI telephone conferences, is it possible that a 
>> secretary or a chair takes meeting notes, then writes them down in a 
>> understandable form, and publish them at the above mentioned page ?
>>
>>
>> Creation of a 'Submitted:Hold' substate ?
>> -----------------------------------------
>> First, as general rules, I consider that :
>>
>> -  In order to AVOID keeping (potentially large) grid resources while 
>> NOT computing, grid Jobs should be designed to be processed completely 
>> automatically, with NO provision for 'Hold' substates,
>>
>> -  A grid Job needing many 'Hold' substates can NOT be handled by an 
>> automatic Submitter, but should be submitted by a human grid User as 
>> an 'Interactive Job', as described for example at 
>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000000000000 
>>
>>
>>
>> Someone asked for the creation of a 'Hold' substate inside the 
>> 'Submitted' state, like inside other states.
>>
>> This 'Submitted:Hold' substate would make sense only if the Job 
>> Submitter could perform an operation on this substate.
>>
>> In order to request such an operation, the Job Submitter needs the 
>> Jobid (or Job EPR).
>>
>> This Jobid (or Job EPR) is guaranteed to be allocated by the Execution 
>> Service only at the END of the 'Submitted' state, but NOT before.
>>
>> Therefore, I consider that the 'Submitted' state can NOT contain a 
>> 'Hold' substate.
>>
>> If anyone thinks otherwise, can he/she please present a convincing Use 
>> Case ?
>>
>>
>> Precisions about the 'Finished with Success or Error' state
>> -----------------------------------------------------------
>> Someone asked that the 'Error' case of the 'Finished with Success or 
>> Error' state should be moved to the 'Failed' state.
>>
>> In fact, inside the current Job State Model, a Job reaches the 
>> 'Finished with Success or Error' state if and only if it successively 
>> reached the end of following states, without failure or cancellation 
>> at the JOB level :
>> -  'Pre-processing'
>> -  'Delegated', whatever the Application result :
>>    - Success = Application return code equal     to zero
>>    - Error   = Application return code different of zero
>> -  'Post-processing'
>>
>> Inside the 'Finished with Success or Error' state :
>> -  Success means 'Application return code was equal     to zero',
>> -  Error   means 'Application return code was different of zero'.
>>
>> I copied this behavior from the Job State Model of 'gLite', where the 
>> 'Done' state contains both the 'Success' and 'Exit Code !=0' cases, as 
>> can be seen in the 'bookkeeping information' at 
>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000000000000 
>>
>>
>>
>> I consider this behavior design, and the strong separation between the 
>> 'Failed' and 'Finished with Success or Error' states, as fully 
>> justified by following reasons :
>>
>> -  Whenever a Job reaches the 'Failed' state, the grid Execution 
>> Service detected an unrecoverable inconsistency at the JOB level.
>>    Therefore, the Job output sandbox and the post-processed 
>> Application output files can potentially be NOT consistent and NOT 
>> even accessible by the Job Submitter.
>>    In order to investigate the Job failure, the grid User then needs 
>> some grid knowledge (and often experience and expertise) to retrieve 
>> and interpret :
>>    - the Job failure code and message,
>>    - the Job logging and bookkeeping, in comparison with the Job 
>> description.
>>    This 'grid level' investigation can sometimes prove that the cause 
>> of the Job failure came from the Application, but is ALWAYS necessary.
>>
>> -  Whenever a Job reaches the 'Finished with Success or Error' state, 
>> the grid Execution Service could create the Job output sandbox, and 
>> perform post-processing on Application output files, WITHOUT detecting 
>> any unrecoverable inconsistency at the JOB level.
>>    Therefore, the Job output sandbox, and the post-processed 
>> Application output files, can be supposed to be consistent and easily 
>> accessible by the Job Submitter.
>>    On a non-zero return code of the Application, the grid User :
>>    - first has to look (WITHOUT needing any grid knowledge) at the Job 
>> output sandbox and at the post-processed Application output files for 
>> an Application problem,
>>    - before, if necessary, using grid knowledge (and often experience 
>> and expertise) to provide any evidence that the Application error was 
>> caused by a faulty Job description, the Batch system, or the grid 
>> Execution Service.
>>
>> As a summary, I consider that the 'Error' case of the 'Finished with 
>> Success or Error' state should be kept as it is, and NOT be moved to 
>> the 'Failed' state.
>>
>> If anyone thinks otherwise, can he/she please present convincing 
>> reasons ?
>>
>>
>> Strawman Rendering
>> ------------------
>> I will work on the ODT version of 'Strawman Rendering' at 
>> http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
>>
>> -  include the above precisions on states,
>>
>> -  include the 'Types of grid Jobs' section of my 'PGI Execution 
>> Service Overview' document,
>>
>> -  check consistency, and present the relationships between the 
>> operations described in chapter 2 'Interface: Execution Port-Type' and 
>> the different states of the different types of grid Jobs.
>>
>>
>> Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14 
>> August 2009 at 16h CET.
>>
>> Best regards.
>>
>> -----------------------------------------------------
>> Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
>>                       Bat 200   91898 ORSAY    France
>> Tel: +33 1 64 46 84 87      Skype: etienne.urbah
>> Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
>> -----------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5073 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20091026/8eec416d/attachment.bin 


More information about the Pgi-wg mailing list