[DRMAA-WG] DRMAA2 comments

Mariusz Mamoński mamonski at man.poznan.pl
Wed May 4 14:07:01 CDT 2011


2011/5/2 Nadav Brandes <nadavbrandes at gmail.com>:
> Thanks for your quick response. Here is my response:
>
> Awesome :)
> As I understood from the spec, JobTemplate::maxSlots relates to the number
> of cores requested on one machine for a single job, and not to the number of
> jobs that may run at the same time. Am I wrong?

No, You are right (except the "one machine" and that slots may not
always mean cores)

> Fair enough. Let's hope that more people will complain about it ;-)
> When working with big clusters and distribution of complicated jobs, there
> are often cases when jobs might arbitrarily fail, for any temporary reason
> such as network problems. For example, if one submits 1,000 jobs, then 10 of
> them might just randomly fail, and have to be rerun in order to finish the
> whole job-array running successfully. If DRMAA had a rerun functionality,
> then the user could do something like this: (The example is in Java)
>
> for (Job job : myJobArray.jobs) {
>    if (job.getState() == JobState.FAILED)
>           job.rerun(); // Will change the job state back to QUEUED, and
> later on to RUNNING (the job will run again from the beginning)
>      }
> }

this would require transitions from FAILED to other states -> FAILED
is not terminal -> avalanche... ;-)

but DRMS can be configured to do that on behalf of the user automagically
>
> Sounds great.
>
> Regards,
> Nadav
>
> 2011/4/29 Peter Tröger <peter at troeger.eu>
>>
>> Hi Nadav,
>> thanks (again) for your in-depth analysis. Here are my comments.
>>
>> Given a jobId, you can easily get its Job object using the method
>> JobSession::getJobs(in JobInfo filter), if you give has as a filter a
>> JobInfo with the wanted jobId (maybe it would be an easier shorthand if
>> DRMAA had a method JobSession::getJob(string jobId), but this is a different
>> issue). But, given a jobArrayId, there is no way to get its JobArray object,
>> which is a great limit of DRMAA that doesn't really let users to use the
>> JobArray feature in DRMAA as it is used in most batch systems. I think that
>> there should be added a similar method JobSession::getJobArrays(in
>> JobArrayInfo filter), or at least a method JobSession::getJobArray(string
>> jobArrayId).
>>
>> Symmetry is always good, I see no problem with adding
>> "JobSession::getJobArrays(in JobArrayInfo filter)".
>>
>> A very important feature that many batch systems support is the ability to
>> limit the number of jobs in a job array that may run simultaneously (in LSF
>> it's called "Slot Limit" and you can read about it at
>> http://www-cecpv.u-strasbg.fr/Documentations/lsf/html/lsf6.1_admin/G_jobarrays.html#26618).
>> I think that DRMAA can also support this feature by:
>>
>> Change the method JobSession::runBulkJobs so it will also accept an
>> optional argument in long slotLimit (if it's UNSET then no slot limit will
>> be assigned to the new job array).
>> Add a new method JobArray::changeSlotLimit(in long slotLimit)
>>
>> This is what JobTemplate::maxSlots is expected to provide.
>>
>> There are some parameters that most batch systems allow changing for
>> already submitted jobs, but DRMAA doesn't support changing them. For
>> example, DRMAA doesn't let you change the priority or queue of an already
>> submitted jobs. I think that methods Job::changePriority(in long priority)
>> and Job::changeQueue(in string queueName) should be added.
>>
>> We discussed the general possibility of changing the attributes of running
>> jobs. There are tons of issues with making such a concept available in a
>> generalized API. One reason are hidden changes of attributes by the DRM
>> system on queuing time - Grid Engine is one example. In such a case, you
>> cannot know what kind of job attribute state your are actually changing. So
>> you need better monitoring. And so on ... The possibilities and supported
>> attributes for online changes also vary widely in the different systems.
>> For this reason, DRMAA intentionally leaves out the complete idea - at
>> least until enough people complain ;-)
>>
>> Many batch systems allow rerunning existing jobs. Although DRMAA has a
>> field called rerunnable in the JobTemplate struct, it doesn't allow users to
>> actually rerun jobs. Maybe a method Job::rerun() could be added to DRMAA.
>>
>> The rerunnable flag is intended to allow the DRM system itself re-running
>> a job. We never had a proposal for such a functionality from user
>> perspective. What would be the expected job state flow in this case ? And
>> what is the use case of having such functionality, if you don't have
>> interactive job support ?
>>
>> I have a question. Does DRMAA support Generic Resources? (for example, if
>> I have a cluster where some of its nodes have GPU cards, and I want to
>> submit jobs that require a certain amount of GPUs, so I would like the batch
>> system to manage it for me, as many batch systems know how to manage).
>>
>> Requesting non-standardized resource types and configurations is expected
>> to be covered by the "jobCategory" concept. Examples for job categories are
>> different MPI libraries, OpenMP environments, Java environments, or GPU
>> environments. We hope to organize a community-based list of recommended job
>> category names, which would raise the chances for portability with such job
>> submission applications. Later DRMAA2 version then could integrate these
>> names as official part of the spec.
>> Best regards,
>> Peter.
>
>
> --
>  drmaa-wg mailing list
>  drmaa-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>



-- 
Mariusz


More information about the drmaa-wg mailing list