[DRMAA-WG] Monitoring JobTemplate attributes for running jobs

Peter Tröger peter at troeger.eu
Wed Aug 25 03:14:42 CDT 2010


Condor has checkpointable jobs, which is the according equivalent. So the answer is yes.

/Peter.

Am 24.08.2010 um 19:28 schrieb Mariusz Mamoński:

> 2010/8/24 Daniel Templeton <daniel.templeton at oracle.com>:
>> How broadly applicable is it?  OGE supports it, and I think LSF does as
>> well.  What about Condor and Torque/PBS?
> Torque also
>> 
>> Daniel
>> 
>> On 08/23/10 07:12 AM, Mariusz Mamoński wrote:
>>> 
>>> Hi,
>>> 
>>>  In some fault-tollerant at the DRM-level scenarios job must be marked
>>> as "rerunnable". Do we want to add this attribute to the DRMAAv2
>>> JobTemplate?
>>> 
>>> Cheers,
>>> 
>>> On 23 August 2010 15:42, Daniel Templeton<daniel.templeton at oracle.com>
>>>  wrote:
>>>> 
>>>> I have a customer who has the resubmission of failed jobs in a greater
>>>> workflow as a critical requirement.  That's not actually something that
>>>> OGE itself supports, so I'm all for having it in DRMAA to plug the hole.
>>>> 
>>>> Daniel
>>>> 
>>>> On 08/23/10 02:50 AM, Peter Tröger wrote:
>>>>> 
>>>>> We already have some understanding of persistency, so the implementation
>>>>> effort is manageable. I am more concerned about a clear separation of live
>>>>> monitoring information and original submission data. For the latter, I saw
>>>>> no use case so far ...
>>>>> 
>>>>> Best,
>>>>> Peter.
>>>>> 
>>>>> Am 29.07.2010 um 11:02 schrieb Andre Merzky:
>>>>> 
>>>>>> Our use case for having access to the original complete job template
>>>>>> is that the user can easily resubmit the same job - just changing
>>>>>> for example some command line parameter, but leaving the remainder
>>>>>> fixed.   In SAGA this would look like:
>>>>>> 
>>>>>>   saga::job::service     js ("drmaa://torque.remote.net/");
>>>>>>   saga::job::job         j1 = js.get_job (jobid);   // std::string
>>>>>>   saga::job::description jd = j1.get_description ();
>>>>>> 
>>>>>>   jd.set_attributes ("Arguments", new_args);  //
>>>>>> std::vector<std::string>
>>>>>> 
>>>>>>   saga::job::job j2 = js.create_job (jd);
>>>>>> 
>>>>>> 
>>>>>> I understand that the backend may no be able to keep the original
>>>>>> job template - in that case, a 'DoesNoExist' exception on
>>>>>> 'get_description()' would be appropriate, IMHO.  If the DRMAA
>>>>>> implementation can cache that description somewhere, fine :-)
>>>>>> 
>>>>>> My $0.02, Andre.
>>>>>> 
>>>>>> 
>>>>>> PS: saga::job::description == drmaa::job::template
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Quoting [Peter Tr?ger] (Jul 29 2010):
>>>>>>> 
>>>>>>> From: Peter Tröger<peter at troeger.eu>
>>>>>>> Date: Thu, 29 Jul 2010 10:07:23 +0200
>>>>>>> To: Mariusz Mamo??ski<mamonski at man.poznan.pl>,
>>>>>>>     drmaa-wg at ogf.org
>>>>>>> Subject: Re: [DRMAA-WG] Monitoring JobTemplate attributes for running
>>>>>>> jobs
>>>>>>> 
>>>>>>> 
>>>>>>> Am 28.07.2010 um 23:42 schrieb Mariusz Mamo??ski:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> 2010/7/28 Peter Tröger<peter at troeger.eu>:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> Agenda item #8 was not discussed in the call today, but it is the
>>>>>>>>> burning
>>>>>>>>> issue for me at the moment. Please have a look in the  "Attributes
>>>>>>>>> in
>>>>>>>>> JobInfo" tab:
>>>>>>>>> 
>>>>>>>>> http://spreadsheets.google.com/ccc?key=0AqyvnBscJNqxcnJBSUs5dXRrU29EUVhGOGthc1lDTFE
>>>>>>>>> Currently, we allow to access the original JobTemplate from a
>>>>>>>>> JobInfo
>>>>>>>>> object. The idea was to get, beside the job monitoring information,
>>>>>>>>> also the
>>>>>>>>> information about what was demanded at submission time.
>>>>>>>>> While doing the Condor mapping, I figured out that most of the
>>>>>>>>> JobTemplate
>>>>>>>>> attributes are also monitorable for a running job. This includes
>>>>>>>>> things such
>>>>>>>>> as executable name and working directory. Normally they should be
>>>>>>>>> the same
>>>>>>>>> as in the JobTemplate, but Condor and SGE (at least) have this magic
>>>>>>>>> job
>>>>>>>>> wrapper stuff, were the admin can automatically and silently
>>>>>>>>> reconfigure /
>>>>>>>>> reinterprete everything in a JobTemplate. This might lead to the
>>>>>>>>> situation
>>>>>>>>> were the user asks for A, and silently gets B.
>>>>>>>>> The question: Should we drop the support for getting the JobTemplate
>>>>>>>>> as part
>>>>>>>>> of JobInfo, because the information is useless ? Instead, we could
>>>>>>>>> add some
>>>>>>>>> (or maybe most) of the JobTemplate attributes as true dynamic
>>>>>>>>> monitoring
>>>>>>>>> information to JobInfo.
>>>>>>>> 
>>>>>>>> in my opinion repeating almost all attributes in this case brings
>>>>>>>> additional redundancy in the DRMAA API (another reason may be
>>>>>>>> performance - the JobTemplate attribute are more likely immutable).
>>>>>>>> Why not simply request expected behavior in the spec? e.g.:
>>>>>>>> a) the JobTemplate being part of the JobInfo struct is a reference to
>>>>>>>> the JobTemplate used for submission (for jobs submitted outside the
>>>>>>>> session it MUST be NULL)
>>>>>>>> b) the JobTemplate reflects actual attributes of a job (without
>>>>>>>> obligation that all attributes must be available - e.g. in Torque the
>>>>>>>> actually executed command is hidden in script)
>>>>>>> 
>>>>>>> Th interesting thing is that we already started to do this
>>>>>>> replication, for example: JobTemplate::candidateMachines vs.
>>>>>>> JobInfo::allocatedMachines. I still vote for finishing this replication, and
>>>>>>> remove the JT reference from JobInfo as compensation. I also have a problem
>>>>>>> with fetching live data from a structure called "template".
>>>>>>> 
>>>>>>> You example from Torque underlines my argumentation - we should choose
>>>>>>> a monitorable sub set of JobTemplate and add it to the JobInfo structure,
>>>>>>> instead of linking the JobTemplate directly.
>>>>>>> 
>>>>>>> Any other opinions ?
>>>>>>> 
>>>>>>> Peter.
>>>>>> 
>>>>>> --
>>>>>> Nothing is ever easy.
>>>>> 
>>>>> --
>>>>>    drmaa-wg mailing list
>>>>>    drmaa-wg at ogf.org
>>>>>    http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>> 
>>>> --
>>>>  drmaa-wg mailing list
>>>>  drmaa-wg at ogf.org
>>>>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Mariusz
> --
>  drmaa-wg mailing list
>  drmaa-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/drmaa-wg



More information about the drmaa-wg mailing list