[DRMAA-WG] DRMAA2

Andre Merzky andre at merzky.net
Fri Jan 14 16:34:27 CST 2011


Hi Peter,

in your proposal below, I am missing the waitAllStarted /
waitAllTerminated versions (which would return void IMHO).  Otherwise
looks great to me.  waitAll is easily implementable in the library
(max cost: 2n*waitAny).

My $0.02,

  Andre.


On Fri, Jan 14, 2011 at 10:36 PM, Peter Tröger <peter at troeger.eu> wrote:
>
> ==== snip ===
> interface JobSession {
> ...
> Job runJob(in DRMAA::JobTemplate jobTemplate)
> JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in
>  long beginIndex, in long endIndex, in  long step)
>                 ...
> }
> interface JobArray {
> readonly attribute string jobArrayId;
> sequence<Job> jobs;
> readonly attribute JobSession session;
> readonly attribute JobTemplate jobTemplate;
> readonly attribute Reservation reservation;
> void suspend()     // suspend all jobs of the array, partial failures in
> changing the state are ok
> void resume()      // resume all jobs of the array, partial failures in
> changing the state are ok
> void hold()        // put a queued bulk job on hold
> void release()     // release an array job on hold
> void terminate()   // terminate a running job
> Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession function
> Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
> function
> };
> ==== snip ===
> Fetching status information makes only sense on job level, so the according
> getInfo() call is not part of the JobArray interface.
> I would also resist the temptation to add a JobArray counterpart of
> getJobs(JobInfo filter), since the filter semantics would become horrible to
> specify.
> All functions should be implementable with the 'loop' fallback in the
> library, when we allow partial success in the bulk control functions.
> DRMAA folks, your comments please. Is this a feasible interface for the
> denoted DRM systems with direct job array control support ?
> Best,
> Peter.
>
>
>
>
> Am 13.01.2011 um 09:23 schrieb Nadav Brandes:
>
> The newer API specification does look a great deal better, and obviously I
> came up with some irrelevant questions.
>
> I'll let you decide what you think about those issues I mentioned that are
> still relevant, but first I want to elaborate a little bit about
> the job-arrays feature, which is the most crucial feature for us.
>
> When dealing with job arrays, each task actually has two IDs (The ID of the
> whole job-array, and the index of the task within the job-array).
> Therefore, in job-arrays, all of the queries and actions that are performed
> on jobs according to the current DRMAA specification, are actually performed
> upon tasks, which are identified by two IDs instead of one, and except of
> that are perfectly similar to single jobs.
>
> All I said so far doesn't make any significant difference, and is only a
> matter of terminology. But the important thing about job-arrays is the
> ability to perform inclusive queries and operations on them.
> For example, one can terminate all of the tasks in a job-array using a
> single command (supplying only the ID of the whole job-array, without
> needing to give the ID of each task, which might be very exhausting for
> users).
> An example for a more advanced logic that one might want to perform on
> job-arrays is to rerun all the failed tasks in a given job-array.
> Another advanced logic might be to limit the number of tasks that may run
> simultaneously in a job-array (for example, submitting a job-array
> containing 1000 tasks, where only 10 tasks are allowed to run simultaneously
> at a given time).
> The greatest advantage of job-arrays, is the ability of users to "remember"
> many tasks with a single ID, what is impossible to do when submitting many
> single jobs.
>
> Many schedulers (like LSF) support all these features, and you can see it
> implemented in a growing number of scheduler.
>
> We believe that DRMAA should support these features as well, by being more
> "job-arrays oriented". I truly believe that DRMAA will be better if it
> supports job-arrays.
>
> 2011/1/12 Mariusz Mamoński <mamonski at man.poznan.pl>
>>
>> Hi Nadav,
>>
>> On 12 January 2011 17:03, Nadav Brandes <nadavbrandes at gmail.com> wrote:
>> > Hello everyone,
>> >
>> > I went over your API description with my team (as described in
>> > http://www.drmaa.org/drmaav2_draft5.pdf).
>> please us the wiki as it is the most up to date version of the DRMAA spec:
>> http://wikis.sun.com/display/DRMAAv2/Home
>> >
>> >
>> >
>> > If it's not too late, we have few questions/suggestions:
>> >
>> > ·         Can one get a 'Job' object representing a job already
>> > submitted
>> > once, given only the job index (as an integer)?
>> It is supported: The JobSession has a method:
>>                sequence<Job> getJobs(JobInfo filter);
>> which as i remember is not constrained to jobs submitted via DRMAA.
>> >
>> > ·         It seems like the 'JobInfo' interface misses few parameters
>> > given
>> > in the 'JobTemplate' interface. For example, can one get the
>> > 'remoteCommand'
>> > of a job that was already submitted, if he only has a 'Job' object in
>> > hand,
>> > and not the 'JobTemplate'?
>> >
>> > ·         Does DRMAA support job-arrays feature (meaning submitting a
>> > group
>> > of tasks in one job, that has a single ID)? Most schedulers support this
>> > feature (include LSF, Moab and SGE). You do have a feature of
>> > 'runBulkJobs'
>> > that sends a sequence of jobs altogether, but it also returns a sequence
>> > of
>> > 'Job' objects, and not a single job with a single ID.
>> IMHO most of the batch systems returns many job ids for job arrays but
>> they offer to do perform some of the operations on the whole array
>> (bulk) by giving common suffix of those job ids. Having one job id,
>> thus one Job complicates state model (what if half of the array
>> sub-jobs are running and the rest queued? What should be the state of
>> the whole array job?)
>> >
>> > ·         Does DRMAA support the notion of queues (a feature that is
>> > supported by all of the schedulers I know)? We believe that it could be
>> > very
>> > useful if one could determine a queue in 'JobTemplate', change the queue
>> > of
>> > an existing job, and also get a list of all the queues in the cluster.
>> this was already addressed (wiki!), except alteration of target queue
>> of already submitted job.
>> >
>> > ·         Many batch systems have a feature that allows giving a
>> > 'project
>> > name' to submitted jobs. We believe that it could also be very useful if
>> > 'JobTemplate' had such field.
>> has: it is called accountingId
>> >
>> > ·         Sometimes, especially when dealing with large clusters
>> > containing
>> > a large number of compute nodes (which some of them might be out of
>> > order),
>> > jobs might fail randomly, without any justified reason. We think it
>> > could be
>> > useful if DRMAA supported a feature that allows rerunning failed jobs
>> > (as
>> > many schedulers allow, like LSF).  Such 'rerun()' method could be added
>> > to
>> > the 'Job' interface.
>> We have: rerunnable attribute of the JobTemplate. So one can configure
>> batch system to rerun jobs that failed due to resources failure
>> >
>> > ·         Modern schedulers (like Moab and LSF) support advanced
>> > features of
>> > memory management, cores management, and also general resources
>> > management
>> > (like GPUs). In general, it means giving a list of required resources to
>> > each submitted job (for example, submitting a job that requires 5 cores,
>> > 12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the jobs
>> > so
>> > each running job will have all the resources it needs. If 'JobTemplate'
>> > had
>> > a resources dictionary field, it could also be very useful.
>> resources that are common for all schedulers are expressed as
>> JobTemplate attributes, e.g.: minPhysMemory
>> others DRMS specific options (also resources requirements)
>>  should go to:          attribute Dictionary drmsSpecific;
>>   // must be supported
>>
>> >
>> >
>> >
>> > This is it for now, thank for reading it.
>> thanks for providing your comments, and sorry that you lost much of
>> time of reading very old version of the specification (@Peter: maybe
>> it would be better to delete reference to the September 2009, DRMAA2
>> Draft 5)
>> >
>> > I would like to hear what you think.
>> >
>> >
>> >
>> > Best Regards,
>> >
>> > Nadav
>> >
>> > 2010/12/21 Peter Tröger <peter at troeger.eu>
>> >>
>> >> Hi Navad,
>> >>
>> >> Now I saw the documentation of the planned interface for DRMAA2, and I
>> >> find it to be a great improvement, and very useful for my organization.
>> >> I am
>> >> truly anxious to try it, and have some more questions about its
>> >> release:
>> >>
>> >> Do you know which distributed resource manager will be the first to
>> >> implement DRMAA2? (SGE maybe?) Also, do you have any estimation on when
>> >> it'll happen, and when will I be able to download a trial version of
>> >> it?
>> >>
>> >> Since we have the Oracle Grid Engine Product Manager as one of the
>> >> co-chairs, I leave the implementation estimation to you ;-) .... We
>> >> also
>> >> have very capable people in Poznan, which might take care of non-OGE
>> >> systems. We assume to put out the spec in January, and from there, the
>> >> group
>> >> can only hope. From experience, I would expect nothing useful before
>> >> Summer
>> >> 2011.
>> >>
>> >> Is it still possible to suggest ideas that we have about the interface
>> >> of
>> >> DRMAA2? If so, how is it done? Is it customary to share ideas in this
>> >> forum,
>> >> or do you prefer it to be done through Wiki?
>> >>
>> >> The best thing is to start a discussion on the list. The Wiki is good
>> >> as
>> >> reference. Comments on the Wiki pages might get lost ...
>> >> Best regards,
>> >> Peter.
>> >
>> >
>> > --
>> >  drmaa-wg mailing list
>> >  drmaa-wg at ogf.org
>> >  http://www.ogf.org/mailman/listinfo/drmaa-wg
>> >
>>
>>
>> Best Regards,
>> --
>> Mariusz
>
> --
>  drmaa-wg mailing list
>  drmaa-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>
> --
>  drmaa-wg mailing list
>  drmaa-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>



-- 
Nothing is ever easy...


More information about the drmaa-wg mailing list