[DRMAA-WG] DRMAA2

Nadav Brandes nadavbrandes at gmail.com
Fri Jan 21 04:12:40 CST 2011


Thank you all for your comments.

And thank you Peter for your comment and draft, I really like it, and it
looks great.



Only few things I would change:



(1)    I agree that putting an option to filter jobs with any generic
user-made filter would be pretty horrible, and it's already better just to
let him iterate over the jobs himself and filter them has he likes. But I
would put a feature that allows filtering jobs in certain status (as many
batch systems support). Something like this:



Interface JobArray {

                …

                Sequence<Job> getJobsOfState(in JobState state)

}





(2)    Also, I think that DRMAA should allow giving job-arrays more
arguments that what regular jobs can get (in JobTemplate struct). For
example, as I mentioned before, you might want to give a 'slotsLimit'
argument to a new submitted job-array (in order to limit the number of tasks
in the job-array that may run simultaneously). Therefore, I would change the
interface to something like this:



struct JobArrayTemplate extends JobTemplate {

                // Contains all the attributes that JobTemplate contains

    // Also contains the following attributes:

                attribute long beginIndex

                attribute long endIndex

                attribute long step

                attribute long slotsLimit; // In order to limit the number
of tasks in the job-array that may run at any one time

                // I guess that more attributes will be added here over-time

}



Interface JobSession {

                …

                JobArray runBulkJobs(in DRMAA::JobArrayTemplate
jobArrayTemplate)

}





By the way, you can see that all the features that I mentioned here are
supported by LSF:

http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618



Best Regards,

Nadav



On Sat, Jan 15, 2011 at 12:34 AM, Andre Merzky <andre at merzky.net> wrote:

> Hi Peter,
>
> in your proposal below, I am missing the waitAllStarted /
> waitAllTerminated versions (which would return void IMHO).  Otherwise
> looks great to me.  waitAll is easily implementable in the library
> (max cost: 2n*waitAny).
>
> My $0.02,
>
>  Andre.
>
>
> On Fri, Jan 14, 2011 at 10:36 PM, Peter Tröger <peter at troeger.eu> wrote:
> >
> > ==== snip ===
> > interface JobSession {
> > ...
> > Job runJob(in DRMAA::JobTemplate jobTemplate)
> > JobArray runBulkJobs(in DRMAA::JobTemplate jobTemplate, in
> >  long beginIndex, in long endIndex, in  long step)
> >                 ...
> > }
> > interface JobArray {
> > readonly attribute string jobArrayId;
> > sequence<Job> jobs;
> > readonly attribute JobSession session;
> > readonly attribute JobTemplate jobTemplate;
> > readonly attribute Reservation reservation;
> > void suspend()     // suspend all jobs of the array, partial failures in
> > changing the state are ok
> > void resume()      // resume all jobs of the array, partial failures in
> > changing the state are ok
> > void hold()        // put a queued bulk job on hold
> > void release()     // release an array job on hold
> > void terminate()   // terminate a running job
> > Job waitAnyStarted(in TimeAmount timeout) // similar to JobSession
> function
> > Job waitAnyTerminated(in TimeAmount timeout) // similar to JobSession
> > function
> > };
> > ==== snip ===
> > Fetching status information makes only sense on job level, so the
> according
> > getInfo() call is not part of the JobArray interface.
> > I would also resist the temptation to add a JobArray counterpart of
> > getJobs(JobInfo filter), since the filter semantics would become horrible
> to
> > specify.
> > All functions should be implementable with the 'loop' fallback in the
> > library, when we allow partial success in the bulk control functions.
> > DRMAA folks, your comments please. Is this a feasible interface for the
> > denoted DRM systems with direct job array control support ?
> > Best,
> > Peter.
> >
> >
> >
> >
> > Am 13.01.2011 um 09:23 schrieb Nadav Brandes:
> >
> > The newer API specification does look a great deal better, and obviously
> I
> > came up with some irrelevant questions.
> >
> > I'll let you decide what you think about those issues I mentioned that
> are
> > still relevant, but first I want to elaborate a little bit about
> > the job-arrays feature, which is the most crucial feature for us.
> >
> > When dealing with job arrays, each task actually has two IDs (The ID of
> the
> > whole job-array, and the index of the task within the job-array).
> > Therefore, in job-arrays, all of the queries and actions that are
> performed
> > on jobs according to the current DRMAA specification, are actually
> performed
> > upon tasks, which are identified by two IDs instead of one, and except of
> > that are perfectly similar to single jobs.
> >
> > All I said so far doesn't make any significant difference, and is only a
> > matter of terminology. But the important thing about job-arrays is the
> > ability to perform inclusive queries and operations on them.
> > For example, one can terminate all of the tasks in a job-array using a
> > single command (supplying only the ID of the whole job-array, without
> > needing to give the ID of each task, which might be very exhausting for
> > users).
> > An example for a more advanced logic that one might want to perform on
> > job-arrays is to rerun all the failed tasks in a given job-array.
> > Another advanced logic might be to limit the number of tasks that may run
> > simultaneously in a job-array (for example, submitting a job-array
> > containing 1000 tasks, where only 10 tasks are allowed to run
> simultaneously
> > at a given time).
> > The greatest advantage of job-arrays, is the ability of users to
> "remember"
> > many tasks with a single ID, what is impossible to do when submitting
> many
> > single jobs.
> >
> > Many schedulers (like LSF) support all these features, and you can see it
> > implemented in a growing number of scheduler.
> >
> > We believe that DRMAA should support these features as well, by being
> more
> > "job-arrays oriented". I truly believe that DRMAA will be better if it
> > supports job-arrays.
> >
> > 2011/1/12 Mariusz Mamoński <mamonski at man.poznan.pl>
> >>
> >> Hi Nadav,
> >>
> >> On 12 January 2011 17:03, Nadav Brandes <nadavbrandes at gmail.com> wrote:
> >> > Hello everyone,
> >> >
> >> > I went over your API description with my team (as described in
> >> > http://www.drmaa.org/drmaav2_draft5.pdf).
> >> please us the wiki as it is the most up to date version of the DRMAA
> spec:
> >> http://wikis.sun.com/display/DRMAAv2/Home
> >> >
> >> >
> >> >
> >> > If it's not too late, we have few questions/suggestions:
> >> >
> >> > ·         Can one get a 'Job' object representing a job already
> >> > submitted
> >> > once, given only the job index (as an integer)?
> >> It is supported: The JobSession has a method:
> >>                sequence<Job> getJobs(JobInfo filter);
> >> which as i remember is not constrained to jobs submitted via DRMAA.
> >> >
> >> > ·         It seems like the 'JobInfo' interface misses few parameters
> >> > given
> >> > in the 'JobTemplate' interface. For example, can one get the
> >> > 'remoteCommand'
> >> > of a job that was already submitted, if he only has a 'Job' object in
> >> > hand,
> >> > and not the 'JobTemplate'?
> >> >
> >> > ·         Does DRMAA support job-arrays feature (meaning submitting a
> >> > group
> >> > of tasks in one job, that has a single ID)? Most schedulers support
> this
> >> > feature (include LSF, Moab and SGE). You do have a feature of
> >> > 'runBulkJobs'
> >> > that sends a sequence of jobs altogether, but it also returns a
> sequence
> >> > of
> >> > 'Job' objects, and not a single job with a single ID.
> >> IMHO most of the batch systems returns many job ids for job arrays but
> >> they offer to do perform some of the operations on the whole array
> >> (bulk) by giving common suffix of those job ids. Having one job id,
> >> thus one Job complicates state model (what if half of the array
> >> sub-jobs are running and the rest queued? What should be the state of
> >> the whole array job?)
> >> >
> >> > ·         Does DRMAA support the notion of queues (a feature that is
> >> > supported by all of the schedulers I know)? We believe that it could
> be
> >> > very
> >> > useful if one could determine a queue in 'JobTemplate', change the
> queue
> >> > of
> >> > an existing job, and also get a list of all the queues in the cluster.
> >> this was already addressed (wiki!), except alteration of target queue
> >> of already submitted job.
> >> >
> >> > ·         Many batch systems have a feature that allows giving a
> >> > 'project
> >> > name' to submitted jobs. We believe that it could also be very useful
> if
> >> > 'JobTemplate' had such field.
> >> has: it is called accountingId
> >> >
> >> > ·         Sometimes, especially when dealing with large clusters
> >> > containing
> >> > a large number of compute nodes (which some of them might be out of
> >> > order),
> >> > jobs might fail randomly, without any justified reason. We think it
> >> > could be
> >> > useful if DRMAA supported a feature that allows rerunning failed jobs
> >> > (as
> >> > many schedulers allow, like LSF).  Such 'rerun()' method could be
> added
> >> > to
> >> > the 'Job' interface.
> >> We have: rerunnable attribute of the JobTemplate. So one can configure
> >> batch system to rerun jobs that failed due to resources failure
> >> >
> >> > ·         Modern schedulers (like Moab and LSF) support advanced
> >> > features of
> >> > memory management, cores management, and also general resources
> >> > management
> >> > (like GPUs). In general, it means giving a list of required resources
> to
> >> > each submitted job (for example, submitting a job that requires 5
> cores,
> >> > 12GB RAM, and 2 GPUs). Then the scheduler knows how to schedule the
> jobs
> >> > so
> >> > each running job will have all the resources it needs. If
> 'JobTemplate'
> >> > had
> >> > a resources dictionary field, it could also be very useful.
> >> resources that are common for all schedulers are expressed as
> >> JobTemplate attributes, e.g.: minPhysMemory
> >> others DRMS specific options (also resources requirements)
> >>  should go to:          attribute Dictionary drmsSpecific;
> >>   // must be supported
> >>
> >> >
> >> >
> >> >
> >> > This is it for now, thank for reading it.
> >> thanks for providing your comments, and sorry that you lost much of
> >> time of reading very old version of the specification (@Peter: maybe
> >> it would be better to delete reference to the September 2009, DRMAA2
> >> Draft 5)
> >> >
> >> > I would like to hear what you think.
> >> >
> >> >
> >> >
> >> > Best Regards,
> >> >
> >> > Nadav
> >> >
> >> > 2010/12/21 Peter Tröger <peter at troeger.eu>
> >> >>
> >> >> Hi Navad,
> >> >>
> >> >> Now I saw the documentation of the planned interface for DRMAA2, and
> I
> >> >> find it to be a great improvement, and very useful for my
> organization.
> >> >> I am
> >> >> truly anxious to try it, and have some more questions about its
> >> >> release:
> >> >>
> >> >> Do you know which distributed resource manager will be the first to
> >> >> implement DRMAA2? (SGE maybe?) Also, do you have any estimation on
> when
> >> >> it'll happen, and when will I be able to download a trial version of
> >> >> it?
> >> >>
> >> >> Since we have the Oracle Grid Engine Product Manager as one of the
> >> >> co-chairs, I leave the implementation estimation to you ;-) .... We
> >> >> also
> >> >> have very capable people in Poznan, which might take care of non-OGE
> >> >> systems. We assume to put out the spec in January, and from there,
> the
> >> >> group
> >> >> can only hope. From experience, I would expect nothing useful before
> >> >> Summer
> >> >> 2011.
> >> >>
> >> >> Is it still possible to suggest ideas that we have about the
> interface
> >> >> of
> >> >> DRMAA2? If so, how is it done? Is it customary to share ideas in this
> >> >> forum,
> >> >> or do you prefer it to be done through Wiki?
> >> >>
> >> >> The best thing is to start a discussion on the list. The Wiki is good
> >> >> as
> >> >> reference. Comments on the Wiki pages might get lost ...
> >> >> Best regards,
> >> >> Peter.
> >> >
> >> >
> >> > --
> >> >  drmaa-wg mailing list
> >> >  drmaa-wg at ogf.org
> >> >  http://www.ogf.org/mailman/listinfo/drmaa-wg
> >> >
> >>
> >>
> >> Best Regards,
> >> --
> >> Mariusz
> >
> > --
> >  drmaa-wg mailing list
> >  drmaa-wg at ogf.org
> >  http://www.ogf.org/mailman/listinfo/drmaa-wg
> >
> > --
> >  drmaa-wg mailing list
> >  drmaa-wg at ogf.org
> >  http://www.ogf.org/mailman/listinfo/drmaa-wg
> >
>
>
>
> --
> Nothing is ever easy...
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110121/70a43756/attachment-0001.html 


More information about the drmaa-wg mailing list