[DRMAA-WG] DRMAA2 comments

Nadav Brandes nadavbrandes at gmail.com
Mon May 2 12:53:56 CDT 2011


Thanks for your quick response. Here is my response:

   1. Awesome :)
   2. As I understood from the spec, *JobTemplate::maxSlots* relates to the
   number of cores requested on one machine for a single job, and not to the
   number of jobs that may run at the same time. Am I wrong?
   3. Fair enough. Let's hope that more people will complain about it ;-)
   4. When working with big clusters and distribution of complicated jobs,
   there are often cases when jobs might arbitrarily fail, for any temporary
   reason such as network problems. For example, if one submits 1,000 jobs,
   then 10 of them might just randomly fail, and have to be rerun in order to
   finish the whole job-array running successfully. If DRMAA had a rerun
   functionality, then the user could do something like this: (The example is
   in Java)
   1. *for (Job job : myJobArray.jobs) {
      *
      2. *   if (job.getState() == JobState.FAILED)
      *
      3. *          job.rerun(); // Will change the job state back to
      QUEUED, and later on to RUNNING (the job will run again from the
beginning)
      *
      4. *     }
      *
      5. *}*
   5. *Sounds great.*


Regards,
Nadav

2011/4/29 Peter Tröger <peter at troeger.eu>

> Hi Nadav,
>
> thanks (again) for your in-depth analysis. Here are my comments.
>
>
>    1. Given a *jobId*, you can easily get its *Job* object using the
>    method *JobSession::getJobs(in JobInfo filter)*, if you give has as a
>    filter a *JobInfo* with the wanted *jobId* (maybe it would be an easier
>    shorthand if DRMAA had a method *JobSession::getJob(string jobId)*, but
>    this is a different issue). *But*, given a *jobArrayId*, there is no
>    way to get its *JobArray* object, which is a great limit of DRMAA that
>    doesn't really let users to use the *JobArray* feature in DRMAA as it
>    is used in most batch systems. I think that there should be added a similar
>    method *JobSession::getJobArrays(in JobArrayInfo filter)*, or at least
>    a method *JobSession::getJobArray(string jobArrayId)*.
>
> Symmetry is always good, I see no problem with adding
> "JobSession::getJobArrays(in JobArrayInfo filter)".
>
>
>    1. A very important feature that many batch systems support is the
>    ability to limit the number of jobs in a job array that may run
>    simultaneously (in LSF it's called "Slot Limit" and you can read about it at
>    http://www-cecpv.u-strasbg.fr/Documentations/lsf
>    /html/lsf6.1_admin/G_jobarrays.html#26618). I think that DRMAA can also
>    support this feature by:
>       1. Change the method *JobSession::runBulkJobs* so it will also
>       accept an optional argument *in long slotLimit* (if it's *UNSET*then no slot limit will be assigned to the new job array).
>       2. Add a new method *JobArray::changeSlotLimit(in long slotLimit)*
>
> This is what JobTemplate::maxSlots is expected to provide.
>
>
>    1. There are some parameters that most batch systems allow changing for
>    already submitted jobs, but DRMAA doesn't support changing them. For
>    example, DRMAA doesn't let you change the priority or queue of an already
>    submitted jobs. I think that methods *Job::changePriority(in long
>    priority) *and *Job::changeQueue(in string queueName)* should be added.
>
> We discussed the general possibility of changing the attributes of running
> jobs. There are tons of issues with making such a concept available in a
> generalized API. One reason are hidden changes of attributes by the DRM
> system on queuing time - Grid Engine is one example. In such a case, you
> cannot know what kind of job attribute state your are actually changing. So
> you need better monitoring. And so on ... The possibilities and supported
> attributes for online changes also vary widely in the different systems.
> For this reason, DRMAA intentionally leaves out the complete idea - at
> least until enough people complain ;-)
>
>
>    1. Many batch systems allow rerunning existing jobs. Although DRMAA has
>    a field called *rerunnable* in the *JobTemplate* struct, it doesn't
>    allow users to actually rerun jobs. Maybe a method *Job::rerun()* could
>    be added to DRMAA.
>
> The rerunnable flag is intended to allow the DRM system itself re-running a
> job. We never had a proposal for such a functionality from user perspective.
> What would be the expected job state flow in this case ? And what is the use
> case of having such functionality, if you don't have interactive job support
> ?
>
>
>    1. I have a question. Does DRMAA support Generic Resources? (for
>    example, if I have a cluster where some of its nodes have GPU cards, and I
>    want to submit jobs that require a certain amount of GPUs, so I would like
>    the batch system to manage it for me, as many batch systems know how to
>    manage).
>
> Requesting non-standardized resource types and configurations is expected
> to be covered by the "jobCategory" concept. Examples for job categories are
> different MPI libraries, OpenMP environments, Java environments, or GPU
> environments. We hope to organize a community-based list of recommended job
> category names, which would raise the chances for portability with such job
> submission applications. Later DRMAA2 version then could integrate these
> names as official part of the spec.
>
> Best regards,
> Peter.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/drmaa-wg/attachments/20110502/e1dbcaf4/attachment.html 


More information about the drmaa-wg mailing list