[Pgi-wg] OGF PGI - AGU Execution Service Strawman Rendering

Fri Dec 4 09:51:25 CST 2009

On Mon, 26 Oct 2009, Etienne URBAH wrote:
> Aleksandr, Balazs, Morris, Luigi, Johannes and all,
> 
> 
> Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI :
> 
> Aleksandr KONSTANTINOV and myself had a telephone talk on Friday 23 
> October at 16h, and we pointed the question 'if we want to tie state 
> changes to operation tightly or operation may aggregate multiple state 
> changes'.
> 
> 
> I am NOT an expert on Web Services.  I can imagine 3 ways to implement 
> message transfers (between Job Submitter and Execution Service) 
> according to the 'Single Job State Model' :
> 
> 
> 
> If Execution Service and Job Submitter both implement notifications
> -------------------------------------------------------------------
> This asynchronous method is most efficient, but is NOT mandatory.
> 
> -  On job submission :
>    - The Job Submitter sends a 'CreateActivity' request containing 2 
> parameters :
>      - The vector of Job descriptions,
>      - The URL for notification.
>    - The Execution Service immediately sends back a 'CreateActivity' 
> response containing Jobids (or error messages).
> 
> -  The Job Submitter waits for notifications.
> 
> -  Whenever the Job Submitter receives from the Execution Service a 
> 'Hold' notification (containing for example the location for manual file 
> staging) :
>    - He/she performs the appropriate work (for example manual file 
> staging),
>    - Then he/she sends a 'ChangeActivityStatus' request (for example to 
> resume Job processing),
>    - The Execution Service immediately sends back a 
> 'ChangeActivityStatus' response describing acceptation or refusal.
> 
> -  As soon as the Job is 'Failed' or 'Finished with success or error', 
> the Job Submitter receives from the Execution Service the appropriate 
> notification.
> 
> -  Then, the Job Submitter may send a 'WipeActivity' request to purge 
> the Job.
> 
> 
> 
> If Execution Service or Job Submitter does NOT implement notifications
> ----------------------------------------------------------------------
> Then the Job submitter has to poll the Job status.
> 
> -  On job submission :
>    - The Job Submitter sends a 'CreateActivity' request containing only 
> 1 parameter :  The vector of Job descriptions,
>    - The Execution Service immediately sends back a 'CreateActivity' 
> response containing Jobids (or error messages).
> 
> -  From time to time :
>    - The Job Submitter sends a 'GetActivityStatus' request,
>    - The Execution Service immediately sends back a 'GetActivityStatus' 
> response describing the Job status and appropriate additional 
> information (for example the location for manual file staging).
> 
> -  Whenever necessary (for example the Job status has just become 'Hold') :
>    - The Job Submitter performs the appropriate work (for example manual 
> file staging),
>    - Then he/she sends a 'ChangeActivityStatus' request (for example to 
> resume Job processing),
>    - The Execution Service immediately sends back a 
> 'ChangeActivityStatus' response describing acceptation or refusal.
> 
> -  When the Job status has become 'Failed' or 'Finished with success or 
> error', the Job Submitter may send a 'WipeActivity' request to purge the 
> Job.
> 
> This method provides consistency with the 'Single Job State Model', but 
> requires repetitive 'GetActivityStatus' requests.
> 
> 
> 
> Method minimizing 'GetActivityStatus' requests without notifications
> --------------------------------------------------------------------
> As far as I have understood from Aleksandr's explanations :
> 
> -  On job submission, the Job Submitter sends a 'CreateActivity' request 
> containing only 1 parameter :  The vector of Job descriptions.
> 
> -  The Execution Service sends back a 'CreateActivity' response 
> containing, for each Job :
>    - Its Jobid (or error message),
>    - If necessary, the location for file stage-in.
> 
> - If manual file stage-in is necessary :
>    - The Job Submitter :
>      - performs the manual file stage-in,
>      - sends a 'ChangeActivityStatus' request (for example to resume Job 
> processing).
>    - The Execution Service sends back a 'ChangeActivityStatus' response 
> describing acceptation or refusal.
> 
> -  From time to time :
>    - The Job Submitter sends a 'GetActivityStatus' request,
>    - The Execution Service immediately sends back a 'GetActivityStatus' 
> response describing the Job status and appropriate additional 
> information (for example the location for manual file stage-out).
> 
> -  Whenever necessary (for example the Job status has just become 
> 'Post-processing:Hold:Manual-Stage-Out') :
>    - The Job Submitter performs the appropriate work (for example manual 
> file stage-out),
>    - Then he/she sends a 'ChangeActivityStatus' request (for example to 
> resume Job processing),
>    - The Execution Service sends back a 'ChangeActivityStatus' response 
> describing acceptation or refusal (for example the Job status has become 
> 'Failed' or 'Finished with success or error').
> 
> -  When the Job status has become 'Failed' or 'Finished with success or 
> error', the Job Submitter may send a 'WipeActivity' request to purge the 
> Job.
> 
> 
> This method minimizes 'GetActivityStatus' requests, but :
> 
> -  The time between the 'CreateActivity' request and the 
> 'CreateActivity' response (containing the location for file stage-in) 
> can be very long (for example if the Job must stay a long time in the 
> 'Submitted' state waiting for computing and/or storage resources ).
> 
> -  Repetitive 'GetActivityStatus' requests are still necessary for the 
> Job Submitter to learn that a Job has reached the 
> 'Post-processing:Hold:Manual-Stage-Out' state (or the 'Finished with 
> success or error' state if no manual stage-out is necessary).
> 
> So, I can NOT guarantee the consistency of this method with the 'Single 
> Job State Model'.
> 
> 
> Please study the above 3 methods carefully, make up your mind, and send 
> comments or remarks, so that we can together improve the design of the 
> messages, and achieve consensus.
> 
> 
> Besides, I will probably NOT be able to attend the PGI telephone 
> conferences on 30 October and 06 November 2009.
> 
> 
> Best regards.
> 
> -----------------------------------------------------
> Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
>                       Bat 200   91898 ORSAY    France
> Tel: +33 1 64 46 84 87      Skype: etienne.urbah
> Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
> -----------------------------------------------------
> 
> 
> On Fri, 16 Oct 2009, Etienne URBAH wrote:
>> Balazs, Morris, Luigi, Johannes and all,
>>
>> Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI 
>> and the telephone conference of last week on 09 October 2009 :
>>
>> -  Many thanks to Morris for having given detailed explanations on 
>> chapter 2.1 'CreateActivity Operation'.
>>    I now much better understand what is described inside an 'operation'.
>>
>> -  Many thanks to Johannes for the Report and for the Action list.
>>
>>
>> Consistency between the CreateActivity operation and the State Model
>> --------------------------------------------------------------------
>> Inside chapter 2.1 'CreateActivity operation', I found discrepancies 
>> between the current description of the 'CreateActivity' operation and 
>> the PGI Single Job State Model :
>>
>> -  Inside the PGI Single Job State Model, the Execution Service :
>>    - Allocates a Jobid (or an EPR) to the Job and sends it back to the 
>> Submitter at the end of the 'Submitted' state, BEFORE any storage 
>> allocation could be performed,
>>    - Notifies the submitter with allocated storage resources for 
>> stage-in only inside the 'Pre-processing:Hold' state.
>>
>> -  The current description of the 'CreateActivity' operation encompass 
>> both the 'Submitted' and 'Pre-processing' states, and describes that 
>> the response can contain information about storage resources for 
>> stage-in.
>>    In fact :
>>    - The 'CreateActivity' operation should be limited to the 
>> 'Submitted' state, and the response can only be only a vector of 
>> Jobids (or EPRs).  Information about storage resources for stage-in 
>> can only be given later, through a 'GetActivityInfo' request or a 
>> notification to the submitter.
>>    - In order to permit notification, the 'CreateActivity' operation 
>> should allow an 'Notification EPR' as an additional optional input 
>> parameter.
>>
>> I have updated the document with changes highlighted at 
>> http://forge.gridforum.org/sf/go/doc15628?nav=1
>>
>>
>> Hold substate inside the 'Submitted' state ?
>> --------------------------------------------
>> See mail below.
>>
>>
>> Best regards.
>>
>> -----------------------------------------------------
>> Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
>>                       Bat 200   91898 ORSAY    France
>> Tel: +33 1 64 46 84 87      Skype: etienne.urbah
>> Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
>> -----------------------------------------------------
>>
>>
>> On Thu, 13 Aug 2009, Etienne URBAH wrote:
>>> Balazs, Morris and all,
>>>
>>>
>>> Concerning the last OGF PGI telephone conference on 05 August 2009 :
>>>
>>>
>>> Meeting notes
>>> -------------
>>> I see NO meeting notes about this telephone conference at 
>>> http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/discussion.meetings 
>>>
>>>
>>> So I am working with my own (fragmentary) notes.
>>>
>>> For all future OGF PGI telephone conferences, is it possible that a 
>>> secretary or a chair takes meeting notes, then writes them down in a 
>>> understandable form, and publish them at the above mentioned page ?
>>>
>>>
>>> Creation of a 'Submitted:Hold' substate ?
>>> -----------------------------------------
>>> First, as general rules, I consider that :
>>>
>>> -  In order to AVOID keeping (potentially large) grid resources while 
>>> NOT computing, grid Jobs should be designed to be processed 
>>> completely automatically, with NO provision for 'Hold' substates,
>>>
>>> -  A grid Job needing many 'Hold' substates can NOT be handled by an 
>>> automatic Submitter, but should be submitted by a human grid User as 
>>> an 'Interactive Job', as described for example at 
>>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000000000000 
>>>
>>>
>>>
>>> Someone asked for the creation of a 'Hold' substate inside the 
>>> 'Submitted' state, like inside other states.
>>>
>>> This 'Submitted:Hold' substate would make sense only if the Job 
>>> Submitter could perform an operation on this substate.
>>>
>>> In order to request such an operation, the Job Submitter needs the 
>>> Jobid (or Job EPR).
>>>
>>> This Jobid (or Job EPR) is guaranteed to be allocated by the 
>>> Execution Service only at the END of the 'Submitted' state, but NOT 
>>> before.
>>>
>>> Therefore, I consider that the 'Submitted' state can NOT contain a 
>>> 'Hold' substate.
>>>
>>> If anyone thinks otherwise, can he/she please present a convincing 
>>> Use Case ?
>>>
>>>
>>> Precisions about the 'Finished with Success or Error' state
>>> -----------------------------------------------------------
>>> Someone asked that the 'Error' case of the 'Finished with Success or 
>>> Error' state should be moved to the 'Failed' state.
>>>
>>> In fact, inside the current Job State Model, a Job reaches the 
>>> 'Finished with Success or Error' state if and only if it successively 
>>> reached the end of following states, without failure or cancellation 
>>> at the JOB level :
>>> -  'Pre-processing'
>>> -  'Delegated', whatever the Application result :
>>>    - Success = Application return code equal     to zero
>>>    - Error   = Application return code different of zero
>>> -  'Post-processing'
>>>
>>> Inside the 'Finished with Success or Error' state :
>>> -  Success means 'Application return code was equal     to zero',
>>> -  Error   means 'Application return code was different of zero'.
>>>
>>> I copied this behavior from the Job State Model of 'gLite', where the 
>>> 'Done' state contains both the 'Success' and 'Exit Code !=0' cases, 
>>> as can be seen in the 'bookkeeping information' at 
>>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000000000000 
>>>
>>>
>>>
>>> I consider this behavior design, and the strong separation between 
>>> the 'Failed' and 'Finished with Success or Error' states, as fully 
>>> justified by following reasons :
>>>
>>> -  Whenever a Job reaches the 'Failed' state, the grid Execution 
>>> Service detected an unrecoverable inconsistency at the JOB level.
>>>    Therefore, the Job output sandbox and the post-processed 
>>> Application output files can potentially be NOT consistent and NOT 
>>> even accessible by the Job Submitter.
>>>    In order to investigate the Job failure, the grid User then needs 
>>> some grid knowledge (and often experience and expertise) to retrieve 
>>> and interpret :
>>>    - the Job failure code and message,
>>>    - the Job logging and bookkeeping, in comparison with the Job 
>>> description.
>>>    This 'grid level' investigation can sometimes prove that the cause 
>>> of the Job failure came from the Application, but is ALWAYS necessary.
>>>
>>> -  Whenever a Job reaches the 'Finished with Success or Error' state, 
>>> the grid Execution Service could create the Job output sandbox, and 
>>> perform post-processing on Application output files, WITHOUT 
>>> detecting any unrecoverable inconsistency at the JOB level.
>>>    Therefore, the Job output sandbox, and the post-processed 
>>> Application output files, can be supposed to be consistent and easily 
>>> accessible by the Job Submitter.
>>>    On a non-zero return code of the Application, the grid User :
>>>    - first has to look (WITHOUT needing any grid knowledge) at the 
>>> Job output sandbox and at the post-processed Application output files 
>>> for an Application problem,
>>>    - before, if necessary, using grid knowledge (and often experience 
>>> and expertise) to provide any evidence that the Application error was 
>>> caused by a faulty Job description, the Batch system, or the grid 
>>> Execution Service.
>>>
>>> As a summary, I consider that the 'Error' case of the 'Finished with 
>>> Success or Error' state should be kept as it is, and NOT be moved to 
>>> the 'Failed' state.
>>>
>>> If anyone thinks otherwise, can he/she please present convincing 
>>> reasons ?
>>>
>>>
>>> Strawman Rendering
>>> ------------------
>>> I will work on the ODT version of 'Strawman Rendering' at 
>>> http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
>>>
>>> -  include the above precisions on states,
>>>
>>> -  include the 'Types of grid Jobs' section of my 'PGI Execution 
>>> Service Overview' document,
>>>
>>> -  check consistency, and present the relationships between the 
>>> operations described in chapter 2 'Interface: Execution Port-Type' 
>>> and the different states of the different types of grid Jobs.
>>>
>>>
>>> Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14 
>>> August 2009 at 16h CET.
>>>
>>> Best regards.
>>>
>>> -----------------------------------------------------
>>> Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
>>>                       Bat 200   91898 ORSAY    France
>>> Tel: +33 1 64 46 84 87      Skype: etienne.urbah
>>> Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
>>> -----------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5073 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20091204/07295bfc/attachment-0001.bin