[DRMAA-WG] Monitoring JobTemplate attributes for running jobs

Daniel Templeton daniel.templeton at oracle.com
Tue Aug 24 12:23:27 CDT 2010


How broadly applicable is it?  OGE supports it, and I think LSF does as 
well.  What about Condor and Torque/PBS?

Daniel

On 08/23/10 07:12 AM, Mariusz Mamoński wrote:
> Hi,
>
>   In some fault-tollerant at the DRM-level scenarios job must be marked
> as "rerunnable". Do we want to add this attribute to the DRMAAv2
> JobTemplate?
>
> Cheers,
>
> On 23 August 2010 15:42, Daniel Templeton<daniel.templeton at oracle.com>  wrote:
>> I have a customer who has the resubmission of failed jobs in a greater
>> workflow as a critical requirement.  That's not actually something that
>> OGE itself supports, so I'm all for having it in DRMAA to plug the hole.
>>
>> Daniel
>>
>> On 08/23/10 02:50 AM, Peter Tröger wrote:
>>> We already have some understanding of persistency, so the implementation effort is manageable. I am more concerned about a clear separation of live monitoring information and original submission data. For the latter, I saw no use case so far ...
>>>
>>> Best,
>>> Peter.
>>>
>>> Am 29.07.2010 um 11:02 schrieb Andre Merzky:
>>>
>>>> Our use case for having access to the original complete job template
>>>> is that the user can easily resubmit the same job - just changing
>>>> for example some command line parameter, but leaving the remainder
>>>> fixed.   In SAGA this would look like:
>>>>
>>>>    saga::job::service     js ("drmaa://torque.remote.net/");
>>>>    saga::job::job         j1 = js.get_job (jobid);   // std::string
>>>>    saga::job::description jd = j1.get_description ();
>>>>
>>>>    jd.set_attributes ("Arguments", new_args);  // std::vector<std::string>
>>>>
>>>>    saga::job::job j2 = js.create_job (jd);
>>>>
>>>>
>>>> I understand that the backend may no be able to keep the original
>>>> job template - in that case, a 'DoesNoExist' exception on
>>>> 'get_description()' would be appropriate, IMHO.  If the DRMAA
>>>> implementation can cache that description somewhere, fine :-)
>>>>
>>>> My $0.02, Andre.
>>>>
>>>>
>>>> PS: saga::job::description == drmaa::job::template
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Quoting [Peter Tr?ger] (Jul 29 2010):
>>>>> From: Peter Tröger<peter at troeger.eu>
>>>>> Date: Thu, 29 Jul 2010 10:07:23 +0200
>>>>> To: Mariusz Mamo??ski<mamonski at man.poznan.pl>,
>>>>>      drmaa-wg at ogf.org
>>>>> Subject: Re: [DRMAA-WG] Monitoring JobTemplate attributes for running jobs
>>>>>
>>>>>
>>>>> Am 28.07.2010 um 23:42 schrieb Mariusz Mamo??ski:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> 2010/7/28 Peter Tröger<peter at troeger.eu>:
>>>>>>> Hi,
>>>>>>> Agenda item #8 was not discussed in the call today, but it is the burning
>>>>>>> issue for me at the moment. Please have a look in the  "Attributes in
>>>>>>> JobInfo" tab:
>>>>>>> http://spreadsheets.google.com/ccc?key=0AqyvnBscJNqxcnJBSUs5dXRrU29EUVhGOGthc1lDTFE
>>>>>>> Currently, we allow to access the original JobTemplate from a JobInfo
>>>>>>> object. The idea was to get, beside the job monitoring information, also the
>>>>>>> information about what was demanded at submission time.
>>>>>>> While doing the Condor mapping, I figured out that most of the JobTemplate
>>>>>>> attributes are also monitorable for a running job. This includes things such
>>>>>>> as executable name and working directory. Normally they should be the same
>>>>>>> as in the JobTemplate, but Condor and SGE (at least) have this magic job
>>>>>>> wrapper stuff, were the admin can automatically and silently reconfigure /
>>>>>>> reinterprete everything in a JobTemplate. This might lead to the situation
>>>>>>> were the user asks for A, and silently gets B.
>>>>>>> The question: Should we drop the support for getting the JobTemplate as part
>>>>>>> of JobInfo, because the information is useless ? Instead, we could add some
>>>>>>> (or maybe most) of the JobTemplate attributes as true dynamic monitoring
>>>>>>> information to JobInfo.
>>>>>> in my opinion repeating almost all attributes in this case brings
>>>>>> additional redundancy in the DRMAA API (another reason may be
>>>>>> performance - the JobTemplate attribute are more likely immutable).
>>>>>> Why not simply request expected behavior in the spec? e.g.:
>>>>>> a) the JobTemplate being part of the JobInfo struct is a reference to
>>>>>> the JobTemplate used for submission (for jobs submitted outside the
>>>>>> session it MUST be NULL)
>>>>>> b) the JobTemplate reflects actual attributes of a job (without
>>>>>> obligation that all attributes must be available - e.g. in Torque the
>>>>>> actually executed command is hidden in script)
>>>>>
>>>>> Th interesting thing is that we already started to do this replication, for example: JobTemplate::candidateMachines vs. JobInfo::allocatedMachines. I still vote for finishing this replication, and remove the JT reference from JobInfo as compensation. I also have a problem with fetching live data from a structure called "template".
>>>>>
>>>>> You example from Torque underlines my argumentation - we should choose a monitorable sub set of JobTemplate and add it to the JobInfo structure, instead of linking the JobTemplate directly.
>>>>>
>>>>> Any other opinions ?
>>>>>
>>>>> Peter.
>>>> --
>>>> Nothing is ever easy.
>>>
>>> --
>>>     drmaa-wg mailing list
>>>     drmaa-wg at ogf.org
>>>     http://www.ogf.org/mailman/listinfo/drmaa-wg
>> --
>>   drmaa-wg mailing list
>>   drmaa-wg at ogf.org
>>   http://www.ogf.org/mailman/listinfo/drmaa-wg
>>
>
>
>


More information about the drmaa-wg mailing list