[Pgi-wg] OGF PGI - AGU Execution Service Strawman Rendering
Etienne URBAH
urbah at lal.in2p3.fr
Fri Dec 4 09:51:25 CST 2009
On Mon, 26 Oct 2009, Etienne URBAH wrote:
> Aleksandr, Balazs, Morris, Luigi, Johannes and all,
>
>
> Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI :
>
> Aleksandr KONSTANTINOV and myself had a telephone talk on Friday 23
> October at 16h, and we pointed the question 'if we want to tie state
> changes to operation tightly or operation may aggregate multiple state
> changes'.
>
>
> I am NOT an expert on Web Services. I can imagine 3 ways to implement
> message transfers (between Job Submitter and Execution Service)
> according to the 'Single Job State Model' :
>
>
>
> If Execution Service and Job Submitter both implement notifications
> -------------------------------------------------------------------
> This asynchronous method is most efficient, but is NOT mandatory.
>
> - On job submission :
> - The Job Submitter sends a 'CreateActivity' request containing 2
> parameters :
> - The vector of Job descriptions,
> - The URL for notification.
> - The Execution Service immediately sends back a 'CreateActivity'
> response containing Jobids (or error messages).
>
> - The Job Submitter waits for notifications.
>
> - Whenever the Job Submitter receives from the Execution Service a
> 'Hold' notification (containing for example the location for manual file
> staging) :
> - He/she performs the appropriate work (for example manual file
> staging),
> - Then he/she sends a 'ChangeActivityStatus' request (for example to
> resume Job processing),
> - The Execution Service immediately sends back a
> 'ChangeActivityStatus' response describing acceptation or refusal.
>
> - As soon as the Job is 'Failed' or 'Finished with success or error',
> the Job Submitter receives from the Execution Service the appropriate
> notification.
>
> - Then, the Job Submitter may send a 'WipeActivity' request to purge
> the Job.
>
>
>
> If Execution Service or Job Submitter does NOT implement notifications
> ----------------------------------------------------------------------
> Then the Job submitter has to poll the Job status.
>
> - On job submission :
> - The Job Submitter sends a 'CreateActivity' request containing only
> 1 parameter : The vector of Job descriptions,
> - The Execution Service immediately sends back a 'CreateActivity'
> response containing Jobids (or error messages).
>
> - From time to time :
> - The Job Submitter sends a 'GetActivityStatus' request,
> - The Execution Service immediately sends back a 'GetActivityStatus'
> response describing the Job status and appropriate additional
> information (for example the location for manual file staging).
>
> - Whenever necessary (for example the Job status has just become 'Hold') :
> - The Job Submitter performs the appropriate work (for example manual
> file staging),
> - Then he/she sends a 'ChangeActivityStatus' request (for example to
> resume Job processing),
> - The Execution Service immediately sends back a
> 'ChangeActivityStatus' response describing acceptation or refusal.
>
> - When the Job status has become 'Failed' or 'Finished with success or
> error', the Job Submitter may send a 'WipeActivity' request to purge the
> Job.
>
> This method provides consistency with the 'Single Job State Model', but
> requires repetitive 'GetActivityStatus' requests.
>
>
>
> Method minimizing 'GetActivityStatus' requests without notifications
> --------------------------------------------------------------------
> As far as I have understood from Aleksandr's explanations :
>
> - On job submission, the Job Submitter sends a 'CreateActivity' request
> containing only 1 parameter : The vector of Job descriptions.
>
> - The Execution Service sends back a 'CreateActivity' response
> containing, for each Job :
> - Its Jobid (or error message),
> - If necessary, the location for file stage-in.
>
> - If manual file stage-in is necessary :
> - The Job Submitter :
> - performs the manual file stage-in,
> - sends a 'ChangeActivityStatus' request (for example to resume Job
> processing).
> - The Execution Service sends back a 'ChangeActivityStatus' response
> describing acceptation or refusal.
>
> - From time to time :
> - The Job Submitter sends a 'GetActivityStatus' request,
> - The Execution Service immediately sends back a 'GetActivityStatus'
> response describing the Job status and appropriate additional
> information (for example the location for manual file stage-out).
>
> - Whenever necessary (for example the Job status has just become
> 'Post-processing:Hold:Manual-Stage-Out') :
> - The Job Submitter performs the appropriate work (for example manual
> file stage-out),
> - Then he/she sends a 'ChangeActivityStatus' request (for example to
> resume Job processing),
> - The Execution Service sends back a 'ChangeActivityStatus' response
> describing acceptation or refusal (for example the Job status has become
> 'Failed' or 'Finished with success or error').
>
> - When the Job status has become 'Failed' or 'Finished with success or
> error', the Job Submitter may send a 'WipeActivity' request to purge the
> Job.
>
>
> This method minimizes 'GetActivityStatus' requests, but :
>
> - The time between the 'CreateActivity' request and the
> 'CreateActivity' response (containing the location for file stage-in)
> can be very long (for example if the Job must stay a long time in the
> 'Submitted' state waiting for computing and/or storage resources ).
>
> - Repetitive 'GetActivityStatus' requests are still necessary for the
> Job Submitter to learn that a Job has reached the
> 'Post-processing:Hold:Manual-Stage-Out' state (or the 'Finished with
> success or error' state if no manual stage-out is necessary).
>
> So, I can NOT guarantee the consistency of this method with the 'Single
> Job State Model'.
>
>
> Please study the above 3 methods carefully, make up your mind, and send
> comments or remarks, so that we can together improve the design of the
> messages, and achieve consensus.
>
>
> Besides, I will probably NOT be able to attend the PGI telephone
> conferences on 30 October and 06 November 2009.
>
>
> Best regards.
>
> -----------------------------------------------------
> Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
> Bat 200 91898 ORSAY France
> Tel: +33 1 64 46 84 87 Skype: etienne.urbah
> Mob: +33 6 22 30 53 27 mailto:urbah at lal.in2p3.fr
> -----------------------------------------------------
>
>
> On Fri, 16 Oct 2009, Etienne URBAH wrote:
>> Balazs, Morris, Luigi, Johannes and all,
>>
>> Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI
>> and the telephone conference of last week on 09 October 2009 :
>>
>> - Many thanks to Morris for having given detailed explanations on
>> chapter 2.1 'CreateActivity Operation'.
>> I now much better understand what is described inside an 'operation'.
>>
>> - Many thanks to Johannes for the Report and for the Action list.
>>
>>
>> Consistency between the CreateActivity operation and the State Model
>> --------------------------------------------------------------------
>> Inside chapter 2.1 'CreateActivity operation', I found discrepancies
>> between the current description of the 'CreateActivity' operation and
>> the PGI Single Job State Model :
>>
>> - Inside the PGI Single Job State Model, the Execution Service :
>> - Allocates a Jobid (or an EPR) to the Job and sends it back to the
>> Submitter at the end of the 'Submitted' state, BEFORE any storage
>> allocation could be performed,
>> - Notifies the submitter with allocated storage resources for
>> stage-in only inside the 'Pre-processing:Hold' state.
>>
>> - The current description of the 'CreateActivity' operation encompass
>> both the 'Submitted' and 'Pre-processing' states, and describes that
>> the response can contain information about storage resources for
>> stage-in.
>> In fact :
>> - The 'CreateActivity' operation should be limited to the
>> 'Submitted' state, and the response can only be only a vector of
>> Jobids (or EPRs). Information about storage resources for stage-in
>> can only be given later, through a 'GetActivityInfo' request or a
>> notification to the submitter.
>> - In order to permit notification, the 'CreateActivity' operation
>> should allow an 'Notification EPR' as an additional optional input
>> parameter.
>>
>> I have updated the document with changes highlighted at
>> http://forge.gridforum.org/sf/go/doc15628?nav=1
>>
>>
>> Hold substate inside the 'Submitted' state ?
>> --------------------------------------------
>> See mail below.
>>
>>
>> Best regards.
>>
>> -----------------------------------------------------
>> Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
>> Bat 200 91898 ORSAY France
>> Tel: +33 1 64 46 84 87 Skype: etienne.urbah
>> Mob: +33 6 22 30 53 27 mailto:urbah at lal.in2p3.fr
>> -----------------------------------------------------
>>
>>
>> On Thu, 13 Aug 2009, Etienne URBAH wrote:
>>> Balazs, Morris and all,
>>>
>>>
>>> Concerning the last OGF PGI telephone conference on 05 August 2009 :
>>>
>>>
>>> Meeting notes
>>> -------------
>>> I see NO meeting notes about this telephone conference at
>>> http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/discussion.meetings
>>>
>>>
>>> So I am working with my own (fragmentary) notes.
>>>
>>> For all future OGF PGI telephone conferences, is it possible that a
>>> secretary or a chair takes meeting notes, then writes them down in a
>>> understandable form, and publish them at the above mentioned page ?
>>>
>>>
>>> Creation of a 'Submitted:Hold' substate ?
>>> -----------------------------------------
>>> First, as general rules, I consider that :
>>>
>>> - In order to AVOID keeping (potentially large) grid resources while
>>> NOT computing, grid Jobs should be designed to be processed
>>> completely automatically, with NO provision for 'Hold' substates,
>>>
>>> - A grid Job needing many 'Hold' substates can NOT be handled by an
>>> automatic Submitter, but should be submitted by a human grid User as
>>> an 'Interactive Job', as described for example at
>>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000000000000
>>>
>>>
>>>
>>> Someone asked for the creation of a 'Hold' substate inside the
>>> 'Submitted' state, like inside other states.
>>>
>>> This 'Submitted:Hold' substate would make sense only if the Job
>>> Submitter could perform an operation on this substate.
>>>
>>> In order to request such an operation, the Job Submitter needs the
>>> Jobid (or Job EPR).
>>>
>>> This Jobid (or Job EPR) is guaranteed to be allocated by the
>>> Execution Service only at the END of the 'Submitted' state, but NOT
>>> before.
>>>
>>> Therefore, I consider that the 'Submitted' state can NOT contain a
>>> 'Hold' substate.
>>>
>>> If anyone thinks otherwise, can he/she please present a convincing
>>> Use Case ?
>>>
>>>
>>> Precisions about the 'Finished with Success or Error' state
>>> -----------------------------------------------------------
>>> Someone asked that the 'Error' case of the 'Finished with Success or
>>> Error' state should be moved to the 'Failed' state.
>>>
>>> In fact, inside the current Job State Model, a Job reaches the
>>> 'Finished with Success or Error' state if and only if it successively
>>> reached the end of following states, without failure or cancellation
>>> at the JOB level :
>>> - 'Pre-processing'
>>> - 'Delegated', whatever the Application result :
>>> - Success = Application return code equal to zero
>>> - Error = Application return code different of zero
>>> - 'Post-processing'
>>>
>>> Inside the 'Finished with Success or Error' state :
>>> - Success means 'Application return code was equal to zero',
>>> - Error means 'Application return code was different of zero'.
>>>
>>> I copied this behavior from the Job State Model of 'gLite', where the
>>> 'Done' state contains both the 'Success' and 'Exit Code !=0' cases,
>>> as can be seen in the 'bookkeeping information' at
>>> https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000000000000
>>>
>>>
>>>
>>> I consider this behavior design, and the strong separation between
>>> the 'Failed' and 'Finished with Success or Error' states, as fully
>>> justified by following reasons :
>>>
>>> - Whenever a Job reaches the 'Failed' state, the grid Execution
>>> Service detected an unrecoverable inconsistency at the JOB level.
>>> Therefore, the Job output sandbox and the post-processed
>>> Application output files can potentially be NOT consistent and NOT
>>> even accessible by the Job Submitter.
>>> In order to investigate the Job failure, the grid User then needs
>>> some grid knowledge (and often experience and expertise) to retrieve
>>> and interpret :
>>> - the Job failure code and message,
>>> - the Job logging and bookkeeping, in comparison with the Job
>>> description.
>>> This 'grid level' investigation can sometimes prove that the cause
>>> of the Job failure came from the Application, but is ALWAYS necessary.
>>>
>>> - Whenever a Job reaches the 'Finished with Success or Error' state,
>>> the grid Execution Service could create the Job output sandbox, and
>>> perform post-processing on Application output files, WITHOUT
>>> detecting any unrecoverable inconsistency at the JOB level.
>>> Therefore, the Job output sandbox, and the post-processed
>>> Application output files, can be supposed to be consistent and easily
>>> accessible by the Job Submitter.
>>> On a non-zero return code of the Application, the grid User :
>>> - first has to look (WITHOUT needing any grid knowledge) at the
>>> Job output sandbox and at the post-processed Application output files
>>> for an Application problem,
>>> - before, if necessary, using grid knowledge (and often experience
>>> and expertise) to provide any evidence that the Application error was
>>> caused by a faulty Job description, the Batch system, or the grid
>>> Execution Service.
>>>
>>> As a summary, I consider that the 'Error' case of the 'Finished with
>>> Success or Error' state should be kept as it is, and NOT be moved to
>>> the 'Failed' state.
>>>
>>> If anyone thinks otherwise, can he/she please present convincing
>>> reasons ?
>>>
>>>
>>> Strawman Rendering
>>> ------------------
>>> I will work on the ODT version of 'Strawman Rendering' at
>>> http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
>>>
>>> - include the above precisions on states,
>>>
>>> - include the 'Types of grid Jobs' section of my 'PGI Execution
>>> Service Overview' document,
>>>
>>> - check consistency, and present the relationships between the
>>> operations described in chapter 2 'Interface: Execution Port-Type'
>>> and the different states of the different types of grid Jobs.
>>>
>>>
>>> Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14
>>> August 2009 at 16h CET.
>>>
>>> Best regards.
>>>
>>> -----------------------------------------------------
>>> Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
>>> Bat 200 91898 ORSAY France
>>> Tel: +33 1 64 46 84 87 Skype: etienne.urbah
>>> Mob: +33 6 22 30 53 27 mailto:urbah at lal.in2p3.fr
>>> -----------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5073 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20091204/07295bfc/attachment-0001.bin
More information about the Pgi-wg
mailing list