[Pgi-wg] OGF PGI - AGU Execution Service Strawman Rendering
Etienne URBAH
urbah at lal.in2p3.fr
Mon Oct 26 13:48:15 CDT 2009
Aleksandr, Balazs, Morris, Luigi, Johannes and all,
Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI :
Aleksandr KONSTANTINOV and myself had a telephone talk on Friday 23
October at 16h, and we pointed the question 'if we want to tie state
changes to operation tightly or operation may aggregate multiple state
changes'.
I am NOT an expert on Web Services. I can imagine 3 ways to implement
message transfers (between Job Submitter and Execution Service)
according to the 'Single Job State Model' :
If Execution Service and Job Submitter both implement notifications
-------------------------------------------------------------------
This asynchronous method is most efficient, but is NOT mandatory.
- On job submission :
- The Job Submitter sends a 'CreateActivity' request containing 2
parameters :
- The vector of Job descriptions,
- The URL for notification.
- The Execution Service immediately sends back a 'CreateActivity'
response containing Jobids (or error messages).
- The Job Submitter waits for notifications.
- Whenever the Job Submitter receives from the Execution Service a
'Hold' notification (containing for example the location for manual file
staging) :
- He/she performs the appropriate work (for example manual file
staging),
- Then he/she sends a 'ChangeActivityStatus' request (for example to
resume Job processing),
- The Execution Service immediately sends back a
'ChangeActivityStatus' response describing acceptation or refusal.
- As soon as the Job is 'Failed' or 'Finished with success or error',
the Job Submitter receives from the Execution Service the appropriate
notification.
- Then, the Job Submitter may send a 'WipeActivity' request to purge
the Job.
If Execution Service or Job Submitter does NOT implement notifications
----------------------------------------------------------------------
Then the Job submitter has to poll the Job status.
- On job submission :
- The Job Submitter sends a 'CreateActivity' request containing only
1 parameter : The vector of Job descriptions,
- The Execution Service immediately sends back a 'CreateActivity'
response containing Jobids (or error messages).
- From time to time :
- The Job Submitter sends a 'GetActivityStatus' request,
- The Execution Service immediately sends back a 'GetActivityStatus'
response describing the Job status and appropriate additional
information (for example the location for manual file staging).
- Whenever necessary (for example the Job status has just become 'Hold') :
- The Job Submitter performs the appropriate work (for example
manual file staging),
- Then he/she sends a 'ChangeActivityStatus' request (for example to
resume Job processing),
- The Execution Service immediately sends back a
'ChangeActivityStatus' response describing acceptation or refusal.
- When the Job status has become 'Failed' or 'Finished with success or
error', the Job Submitter may send a 'WipeActivity' request to purge the
Job.
This method provides consistency with the 'Single Job State Model', but
requires repetitive 'GetActivityStatus' requests.
Method minimizing 'GetActivityStatus' requests without notifications
--------------------------------------------------------------------
As far as I have understood from Aleksandr's explanations :
- On job submission, the Job Submitter sends a 'CreateActivity' request
containing only 1 parameter : The vector of Job descriptions.
- The Execution Service sends back a 'CreateActivity' response
containing, for each Job :
- Its Jobid (or error message),
- If necessary, the location for file stage-in.
- If manual file stage-in is necessary :
- The Job Submitter :
- performs the manual file stage-in,
- sends a 'ChangeActivityStatus' request (for example to resume
Job processing).
- The Execution Service sends back a 'ChangeActivityStatus' response
describing acceptation or refusal.
- From time to time :
- The Job Submitter sends a 'GetActivityStatus' request,
- The Execution Service immediately sends back a 'GetActivityStatus'
response describing the Job status and appropriate additional
information (for example the location for manual file stage-out).
- Whenever necessary (for example the Job status has just become
'Post-processing:Hold:Manual-Stage-Out') :
- The Job Submitter performs the appropriate work (for example
manual file stage-out),
- Then he/she sends a 'ChangeActivityStatus' request (for example to
resume Job processing),
- The Execution Service sends back a 'ChangeActivityStatus' response
describing acceptation or refusal (for example the Job status has become
'Failed' or 'Finished with success or error').
- When the Job status has become 'Failed' or 'Finished with success or
error', the Job Submitter may send a 'WipeActivity' request to purge the
Job.
This method minimizes 'GetActivityStatus' requests, but :
- The time between the 'CreateActivity' request and the
'CreateActivity' response (containing the location for file stage-in)
can be very long (for example if the Job must stay a long time in the
'Submitted' state waiting for computing and/or storage resources ).
- Repetitive 'GetActivityStatus' requests are still necessary for the
Job Submitter to learn that a Job has reached the
'Post-processing:Hold:Manual-Stage-Out' state (or the 'Finished with
success or error' state if no manual stage-out is necessary).
So, I can NOT guarantee the consistency of this method with the 'Single
Job State Model'.
Please study the above 3 methods carefully, make up your mind, and send
comments or remarks, so that we can together improve the design of the
messages, and achieve consensus.
Besides, I will probably NOT be able to attend the PGI telephone
conferences on 30 October and 06 November 2009.
Best regards.
-----------------------------------------------------
Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
Bat 200 91898 ORSAY France
Tel: +33 1 64 46 84 87 Skype: etienne.urbah
Mob: +33 6 22 30 53 27 mailto:urbah at lal.in2p3.fr
-----------------------------------------------------
On Fri, 16 Oct 2009, Etienne URBAH wrote:
> Balazs, Morris, Luigi, Johannes and all,
>
> Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI and
> the telephone conference of last week on 09 October 2009 :
>
> - Many thanks to Morris for having given detailed explanations on
> chapter 2.1 'CreateActivity Operation'.
> I now much better understand what is described inside an 'operation'.
>
> - Many thanks to Johannes for the Report and for the Action list.
>
>
> Consistency between the CreateActivity operation and the State Model
> --------------------------------------------------------------------
> Inside chapter 2.1 'CreateActivity operation', I found discrepancies
> between the current description of the 'CreateActivity' operation and
> the PGI Single Job State Model :
>
> - Inside the PGI Single Job State Model, the Execution Service :
> - Allocates a Jobid (or an EPR) to the Job and sends it back to the
> Submitter at the end of the 'Submitted' state, BEFORE any storage
> allocation could be performed,
> - Notifies the submitter with allocated storage resources for
> stage-in only inside the 'Pre-processing:Hold' state.
>
> - The current description of the 'CreateActivity' operation encompass
> both the 'Submitted' and 'Pre-processing' states, and describes that the
> response can contain information about storage resources for stage-in.
> In fact :
> - The 'CreateActivity' operation should be limited to the 'Submitted'
> state, and the response can only be only a vector of Jobids (or EPRs).
> Information about storage resources for stage-in can only be given
> later, through a 'GetActivityInfo' request or a notification to the
> submitter.
> - In order to permit notification, the 'CreateActivity' operation
> should allow an 'Notification EPR' as an additional optional input
> parameter.
>
> I have updated the document with changes highlighted at
> http://forge.gridforum.org/sf/go/doc15628?nav=1
>
>
> Hold substate inside the 'Submitted' state ?
> --------------------------------------------
> See mail below.
>
>
> Best regards.
>
> -----------------------------------------------------
> Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
> Bat 200 91898 ORSAY France
> Tel: +33 1 64 46 84 87 Skype: etienne.urbah
> Mob: +33 6 22 30 53 27 mailto:urbah at lal.in2p3.fr
> -----------------------------------------------------
>
>
> On Thu, 13 Aug 2009, Etienne URBAH wrote:
>> Balazs, Morris and all,
>>
>>
>> Concerning the last OGF PGI telephone conference on 05 August 2009 :
>>
>>
>> Meeting notes
>> -------------
>> I see NO meeting notes about this telephone conference at
>> http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/discussion.meetings
>>
>>
>> So I am working with my own (fragmentary) notes.
>>
>> For all future OGF PGI telephone conferences, is it possible that a
>> secretary or a chair takes meeting notes, then writes them down in a
>> understandable form, and publish them at the above mentioned page ?
>>
>>
>> Creation of a 'Submitted:Hold' substate ?
>> -----------------------------------------
>> First, as general rules, I consider that :
>>
>> - In order to AVOID keeping (potentially large) grid resources while
>> NOT computing, grid Jobs should be designed to be processed completely
>> automatically, with NO provision for 'Hold' substates,
>>
>> - A grid Job needing many 'Hold' substates can NOT be handled by an
>> automatic Submitter, but should be submitted by a human grid User as
>> an 'Interactive Job', as described for example at
>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000000000000
>>
>>
>>
>> Someone asked for the creation of a 'Hold' substate inside the
>> 'Submitted' state, like inside other states.
>>
>> This 'Submitted:Hold' substate would make sense only if the Job
>> Submitter could perform an operation on this substate.
>>
>> In order to request such an operation, the Job Submitter needs the
>> Jobid (or Job EPR).
>>
>> This Jobid (or Job EPR) is guaranteed to be allocated by the Execution
>> Service only at the END of the 'Submitted' state, but NOT before.
>>
>> Therefore, I consider that the 'Submitted' state can NOT contain a
>> 'Hold' substate.
>>
>> If anyone thinks otherwise, can he/she please present a convincing Use
>> Case ?
>>
>>
>> Precisions about the 'Finished with Success or Error' state
>> -----------------------------------------------------------
>> Someone asked that the 'Error' case of the 'Finished with Success or
>> Error' state should be moved to the 'Failed' state.
>>
>> In fact, inside the current Job State Model, a Job reaches the
>> 'Finished with Success or Error' state if and only if it successively
>> reached the end of following states, without failure or cancellation
>> at the JOB level :
>> - 'Pre-processing'
>> - 'Delegated', whatever the Application result :
>> - Success = Application return code equal to zero
>> - Error = Application return code different of zero
>> - 'Post-processing'
>>
>> Inside the 'Finished with Success or Error' state :
>> - Success means 'Application return code was equal to zero',
>> - Error means 'Application return code was different of zero'.
>>
>> I copied this behavior from the Job State Model of 'gLite', where the
>> 'Done' state contains both the 'Success' and 'Exit Code !=0' cases, as
>> can be seen in the 'bookkeeping information' at
>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000000000000
>>
>>
>>
>> I consider this behavior design, and the strong separation between the
>> 'Failed' and 'Finished with Success or Error' states, as fully
>> justified by following reasons :
>>
>> - Whenever a Job reaches the 'Failed' state, the grid Execution
>> Service detected an unrecoverable inconsistency at the JOB level.
>> Therefore, the Job output sandbox and the post-processed
>> Application output files can potentially be NOT consistent and NOT
>> even accessible by the Job Submitter.
>> In order to investigate the Job failure, the grid User then needs
>> some grid knowledge (and often experience and expertise) to retrieve
>> and interpret :
>> - the Job failure code and message,
>> - the Job logging and bookkeeping, in comparison with the Job
>> description.
>> This 'grid level' investigation can sometimes prove that the cause
>> of the Job failure came from the Application, but is ALWAYS necessary.
>>
>> - Whenever a Job reaches the 'Finished with Success or Error' state,
>> the grid Execution Service could create the Job output sandbox, and
>> perform post-processing on Application output files, WITHOUT detecting
>> any unrecoverable inconsistency at the JOB level.
>> Therefore, the Job output sandbox, and the post-processed
>> Application output files, can be supposed to be consistent and easily
>> accessible by the Job Submitter.
>> On a non-zero return code of the Application, the grid User :
>> - first has to look (WITHOUT needing any grid knowledge) at the Job
>> output sandbox and at the post-processed Application output files for
>> an Application problem,
>> - before, if necessary, using grid knowledge (and often experience
>> and expertise) to provide any evidence that the Application error was
>> caused by a faulty Job description, the Batch system, or the grid
>> Execution Service.
>>
>> As a summary, I consider that the 'Error' case of the 'Finished with
>> Success or Error' state should be kept as it is, and NOT be moved to
>> the 'Failed' state.
>>
>> If anyone thinks otherwise, can he/she please present convincing
>> reasons ?
>>
>>
>> Strawman Rendering
>> ------------------
>> I will work on the ODT version of 'Strawman Rendering' at
>> http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
>>
>> - include the above precisions on states,
>>
>> - include the 'Types of grid Jobs' section of my 'PGI Execution
>> Service Overview' document,
>>
>> - check consistency, and present the relationships between the
>> operations described in chapter 2 'Interface: Execution Port-Type' and
>> the different states of the different types of grid Jobs.
>>
>>
>> Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14
>> August 2009 at 16h CET.
>>
>> Best regards.
>>
>> -----------------------------------------------------
>> Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
>> Bat 200 91898 ORSAY France
>> Tel: +33 1 64 46 84 87 Skype: etienne.urbah
>> Mob: +33 6 22 30 53 27 mailto:urbah at lal.in2p3.fr
>> -----------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5073 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20091026/8eec416d/attachment.bin
More information about the Pgi-wg
mailing list