[DRMAA-WG] Monitoring JobTemplate attributes for running jobs

Mariusz Mamoński mamonski at man.poznan.pl
Tue Aug 24 12:28:27 CDT 2010


2010/8/24 Daniel Templeton <daniel.templeton at oracle.com>:
> How broadly applicable is it?  OGE supports it, and I think LSF does as
> well.  What about Condor and Torque/PBS?
Torque also
>
> Daniel
>
> On 08/23/10 07:12 AM, Mariusz Mamoński wrote:
>>
>> Hi,
>>
>>  In some fault-tollerant at the DRM-level scenarios job must be marked
>> as "rerunnable". Do we want to add this attribute to the DRMAAv2
>> JobTemplate?
>>
>> Cheers,
>>
>> On 23 August 2010 15:42, Daniel Templeton<daniel.templeton at oracle.com>
>>  wrote:
>>>
>>> I have a customer who has the resubmission of failed jobs in a greater
>>> workflow as a critical requirement.  That's not actually something that
>>> OGE itself supports, so I'm all for having it in DRMAA to plug the hole.
>>>
>>> Daniel
>>>
>>> On 08/23/10 02:50 AM, Peter Tröger wrote:
>>>>
>>>> We already have some understanding of persistency, so the implementation
>>>> effort is manageable. I am more concerned about a clear separation of live
>>>> monitoring information and original submission data. For the latter, I saw
>>>> no use case so far ...
>>>>
>>>> Best,
>>>> Peter.
>>>>
>>>> Am 29.07.2010 um 11:02 schrieb Andre Merzky:
>>>>
>>>>> Our use case for having access to the original complete job template
>>>>> is that the user can easily resubmit the same job - just changing
>>>>> for example some command line parameter, but leaving the remainder
>>>>> fixed.   In SAGA this would look like:
>>>>>
>>>>>   saga::job::service     js ("drmaa://torque.remote.net/");
>>>>>   saga::job::job         j1 = js.get_job (jobid);   // std::string
>>>>>   saga::job::description jd = j1.get_description ();
>>>>>
>>>>>   jd.set_attributes ("Arguments", new_args);  //
>>>>> std::vector<std::string>
>>>>>
>>>>>   saga::job::job j2 = js.create_job (jd);
>>>>>
>>>>>
>>>>> I understand that the backend may no be able to keep the original
>>>>> job template - in that case, a 'DoesNoExist' exception on
>>>>> 'get_description()' would be appropriate, IMHO.  If the DRMAA
>>>>> implementation can cache that description somewhere, fine :-)
>>>>>
>>>>> My $0.02, Andre.
>>>>>
>>>>>
>>>>> PS: saga::job::description == drmaa::job::template
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Quoting [Peter Tr?ger] (Jul 29 2010):
>>>>>>
>>>>>> From: Peter Tröger<peter at troeger.eu>
>>>>>> Date: Thu, 29 Jul 2010 10:07:23 +0200
>>>>>> To: Mariusz Mamo??ski<mamonski at man.poznan.pl>,
>>>>>>     drmaa-wg at ogf.org
>>>>>> Subject: Re: [DRMAA-WG] Monitoring JobTemplate attributes for running
>>>>>> jobs
>>>>>>
>>>>>>
>>>>>> Am 28.07.2010 um 23:42 schrieb Mariusz Mamo??ski:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> 2010/7/28 Peter Tröger<peter at troeger.eu>:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Agenda item #8 was not discussed in the call today, but it is the
>>>>>>>> burning
>>>>>>>> issue for me at the moment. Please have a look in the  "Attributes
>>>>>>>> in
>>>>>>>> JobInfo" tab:
>>>>>>>>
>>>>>>>> http://spreadsheets.google.com/ccc?key=0AqyvnBscJNqxcnJBSUs5dXRrU29EUVhGOGthc1lDTFE
>>>>>>>> Currently, we allow to access the original JobTemplate from a
>>>>>>>> JobInfo
>>>>>>>> object. The idea was to get, beside the job monitoring information,
>>>>>>>> also the
>>>>>>>> information about what was demanded at submission time.
>>>>>>>> While doing the Condor mapping, I figured out that most of the
>>>>>>>> JobTemplate
>>>>>>>> attributes are also monitorable for a running job. This includes
>>>>>>>> things such
>>>>>>>> as executable name and working directory. Normally they should be
>>>>>>>> the same
>>>>>>>> as in the JobTemplate, but Condor and SGE (at least) have this magic
>>>>>>>> job
>>>>>>>> wrapper stuff, were the admin can automatically and silently
>>>>>>>> reconfigure /
>>>>>>>> reinterprete everything in a JobTemplate. This might lead to the
>>>>>>>> situation
>>>>>>>> were the user asks for A, and silently gets B.
>>>>>>>> The question: Should we drop the support for getting the JobTemplate
>>>>>>>> as part
>>>>>>>> of JobInfo, because the information is useless ? Instead, we could
>>>>>>>> add some
>>>>>>>> (or maybe most) of the JobTemplate attributes as true dynamic
>>>>>>>> monitoring
>>>>>>>> information to JobInfo.
>>>>>>>
>>>>>>> in my opinion repeating almost all attributes in this case brings
>>>>>>> additional redundancy in the DRMAA API (another reason may be
>>>>>>> performance - the JobTemplate attribute are more likely immutable).
>>>>>>> Why not simply request expected behavior in the spec? e.g.:
>>>>>>> a) the JobTemplate being part of the JobInfo struct is a reference to
>>>>>>> the JobTemplate used for submission (for jobs submitted outside the
>>>>>>> session it MUST be NULL)
>>>>>>> b) the JobTemplate reflects actual attributes of a job (without
>>>>>>> obligation that all attributes must be available - e.g. in Torque the
>>>>>>> actually executed command is hidden in script)
>>>>>>
>>>>>> Th interesting thing is that we already started to do this
>>>>>> replication, for example: JobTemplate::candidateMachines vs.
>>>>>> JobInfo::allocatedMachines. I still vote for finishing this replication, and
>>>>>> remove the JT reference from JobInfo as compensation. I also have a problem
>>>>>> with fetching live data from a structure called "template".
>>>>>>
>>>>>> You example from Torque underlines my argumentation - we should choose
>>>>>> a monitorable sub set of JobTemplate and add it to the JobInfo structure,
>>>>>> instead of linking the JobTemplate directly.
>>>>>>
>>>>>> Any other opinions ?
>>>>>>
>>>>>> Peter.
>>>>>
>>>>> --
>>>>> Nothing is ever easy.
>>>>
>>>> --
>>>>    drmaa-wg mailing list
>>>>    drmaa-wg at ogf.org
>>>>    http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>
>>> --
>>>  drmaa-wg mailing list
>>>  drmaa-wg at ogf.org
>>>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>
>>
>>
>>
>



-- 
Mariusz


More information about the drmaa-wg mailing list