[DRMAA-WG] Load average interval ?

Fri Mar 26 14:32:04 CDT 2010

I meant from the monitoring side.  Even reporting the number of 
simultaneously available slots in SGE isn't particularly useful.  To 
understand the meaning of the slot count, you have to understand the 
configuration of the scheduler, and that is clearly out of bounds for DRMAA.

For the submission of a parallel job, the concept of slots is not really 
required.  Parallel jobs just need to specify how many slaves there are 
and where they should run (i.e. how many per machine).  How that relates 
to slots in the DRMS is unimportant.

I get that it would be nice to be able to expose slots in a useful way 
via DRMAA, but I'm doubtful that it's possible.  Just like queues, the 
meaning (or really application) of slots is too DRM-specific.

Daniel

On 03/26/10 08:21, Mariusz Mamoński wrote:
> 2010/3/26 Daniel Templeton<daniel.templeton at oracle.com>:
>> I think slots is a concept that is out of scope for DRMAA.  There's
>> absolutely zero value in reporting slot counts in SGE unless you're also
>> going to report the queue configurations and resource policies, because the
>> total number of slots is almost never available for simultaneous use.
> i meant, the number of slots for "simultaneous use", i.e. the
> system/machine capacity counted as maximum number of single process
> jobs allowed to run concurrently. Sorry, but i'm slightly confused
> with the "slots is a concept that is out of scope DRMAA" (i read here
> DRMAA as DRMS API): What you are giving upon submission as an argument
> of the parallel environment: cores/cpu?
>>
>> Daniel
>>
>> On 03/26/10 07:42, Mariusz Mamoński wrote:
>>>
>>> On 26 March 2010 15:36, Daniel Templeton<daniel.templeton at oracle.com>
>>>   wrote:
>>>>
>>>> The concept of slots in SGE is only loosely bound to CPU architecture.
>>>> We assume a slot per thread or core, but it's only a suggestion.
>>>> Administrators can configure an arbitrary number of slots.  For example,
>>>> the 1-node test cluster I have running on my workstation current has
>>>> over 200 slots on a dual-core machine.
>>>
>>> Is it common to observe production system that permits oversubsription
>>> of cpus? We can always add slots as machineInfo attribute additional
>>> (or instead of) to cpu/cores.
>>>>
>>>> Daniel
>>>>
>>>> On 03/25/10 17:09, Andre Merzky wrote:
>>>>>
>>>>> Quoting [Peter Tr?ger] (Mar 26 2010):
>>>>>>
>>>>>> Condor usually reports the number of cores incl. hyperthreaded ones,
>>>>>> which confirms to the 'concurrent threads' metric Daniel proposed. To
>>>>>> my (negative) surprise, they report nothing else:
>>>>>>
>>>>>> http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
>>>>>>
>>>>>> When we only look into this case, the according attribute could be
>>>>>> named 'supportedSlots', since we created the understanding of slots as
>>>>>> resources for concurrent job activities / threads / processes. The
>>>>>> sockets attribute would not be implementable in Condor. The value of
>>>>>> the cores attribute could be guessable (supportedSlots/2).
>>>>>
>>>>> Please don't hardcode that number '2': that is only valid for Intels
>>>>> Hyperthreading, and only at this point in time... ;-)
>>>>>
>>>>> Anyway: if one has to chose, the hardware threads are likely more
>>>>> useful than cores, IMHO, although learning both, or even the full
>>>>> hierarchy (nodes/sockets/cores/threads) would be simply nice...
>>>>>
>>>>> Best, Andre.
>>>>>
>>>>>
>>>>>>
>>>>>> But Condor is not our primary use case ;-)
>>>>>>
>>>>>> /Peter.
>>>>>>
>>>>>>
>>>>>> Am 25.03.2010 um 16:50 schrieb Daniel Gruber:
>>>>>>
>>>>>>> I would also vote for the total amount of cores and sockets :)
>>>>>>>
>>>>>>> We could also think about reporting the amount of concurrent
>>>>>>> threads that are supported by the hardware (hyperthreading in
>>>>>>> case of Intel or chip-multithreading in case of Sun T2 processors).
>>>>>>> This could prevent the user for puzzling out what is meant by
>>>>>>> a core (is it a real one, or the hyperthreading/CMT thing).
>>>>>>>
>>>>>>> If not we should at least define that a core is really a physical
>>>>>>> core.
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> On 03/25/10 15:44, Daniel Templeton wrote:
>>>>>>>>
>>>>>>>> I would tend to agree that total core count is more useful.  SGE also
>>>>>>>> reports socket count as of 6.2u5, by the way.  (That's actually
>>>>>>>> thanks
>>>>>>>> to our own Daniel Gruber.)
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> On 03/25/10 07:03, Mariusz Mamo??ski wrote:
>>>>>>>>
>>>>>>>>> Also for me. As we are talking about monitoring interface i propose
>>>>>>>>> two more changes to the machine monitoring interface:
>>>>>>>>>
>>>>>>>>> 1. Having a new data struct called "MachineInfo" with attributes
>>>>>>>>> like
>>>>>>>>> Load, PhysMemory, ... and getMachineInfo(in String machineName)
>>>>>>>>> method
>>>>>>>>> in the Monitoring interface. Rationale: the same as for the JobInfo
>>>>>>>>> (consistency issue, fetching all machines attributes at once is more
>>>>>>>>> natural in DRMS APIs then querying for each attribute separately)
>>>>>>>>>
>>>>>>>>> 2. change machineCoresPerSocket to machinesCores, if one have
>>>>>>>>> machineSockets he or she can easily determine the
>>>>>>>>> machineCoresPerSocket. The problem with the current API is that if
>>>>>>>>> the
>>>>>>>>> DRM do not support "machineSockets" (as far i checked only LSF
>>>>>>>>> provide
>>>>>>>>> this two-level granularity @see Google Doc) we loose the most
>>>>>>>>> essential information: "how many single processing units do we
>>>>>>>>> have on
>>>>>>>>> single machine?"
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> On 23 March 2010 23:00, Daniel
>>>>>>>>> Templeton<daniel.templeton at oracle.com>       wrote:
>>>>>>>>>
>>>>>>>>>> That's fine with me.
>>>>>>>>>>
>>>>>>>>>> Daniel
>>>>>>>>>>
>>>>>>>>>> On 03/23/10 13:51, Peter Tröger wrote:
>>>>>>>>>>
>>>>>>>>>>>> Any non-SGE opinion ?
>>>>>>>>>>>>
>>>>>>>>>>> Here is mine:
>>>>>>>>>>>
>>>>>>>>>>> I could only find one single source that explains the load average
>>>>>>>>>>> source in Condor :)
>>>>>>>>>>>
>>>>>>>>>>> http://www.patentstorm.us/patents/5978829/description.html
>>>>>>>>>>>
>>>>>>>>>>> Condor provides only the 1-minute load average from the uptime
>>>>>>>>>>> command.
>>>>>>>>>>>
>>>>>>>>>>> Same holds for Moab:
>>>>>>>>>>>
>>>>>>>>>>> http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
>>>>>>>>>>>
>>>>>>>>>>> And PBS:
>>>>>>>>>>>
>>>>>>>>>>> http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
>>>>>>>>>>>
>>>>>>>>>>> And MAUI:
>>>>>>>>>>> https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
>>>>>>>>>>>
>>>>>>>>>>> I vote for reporting only the 1-minute load average.
>>>>>>>>>>>
>>>>>>>>>>> /Peter.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> And BTW, by using the uptime(1) load semantics, we loose Windows
>>>>>>>>>>>> support. There is no such attribute there, load is measured in
>>>>>>>>>>>> percentage of non-idle time, and has no direct relationship to
>>>>>>>>>>>> the
>>>>>>>>>>>> ready queue lengths.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Peter.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 22.03.2010 um 16:02 schrieb Daniel Templeton:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> SGE tends to look at the 5-minute average, although any can be
>>>>>>>>>>>>> configured.  You could solve it the same way we did for SGE --
>>>>>>>>>>>>> offer
>>>>>>>>>>>>> three: machineLoadShort, machineLoadMed, machineLoadLong.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Daniel
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 03/22/10 06:05, Peter Tröger wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> next remaining thing from OGF28:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We support the determination of machineLoad average in the
>>>>>>>>>>>>>> MonitoringSession interface. At OGF, we could not agree on
>>>>>>>>>>>>>> which of
>>>>>>>>>>>>>> the typical intervals (1/5/15 minutes) we want to use here.
>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>> all of them ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Peter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>      drmaa-wg mailing list
>>>>>>>>>>>>>>      drmaa-wg at ogf.org
>>>>>>>>>>>>>>      http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> drmaa-wg mailing list
>>>>>>>>>>>>> drmaa-wg at ogf.org
>>>>>>>>>>>>> http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>      drmaa-wg mailing list
>>>>>>>>>>>>      drmaa-wg at ogf.org
>>>>>>>>>>>>      http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>       drmaa-wg mailing list
>>>>>>>>>>>       drmaa-wg at ogf.org
>>>>>>>>>>>       http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>     drmaa-wg mailing list
>>>>>>>>>>     drmaa-wg at ogf.org
>>>>>>>>>>     http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>     drmaa-wg mailing list
>>>>>>>>     drmaa-wg at ogf.org
>>>>>>>>     http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>    drmaa-wg mailing list
>>>>>>>    drmaa-wg at ogf.org
>>>>>>>    http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>
>>>> --
>>>>   drmaa-wg mailing list
>>>>   drmaa-wg at ogf.org
>>>>   http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>
>>>
>>>
>>>
>>
>>
>
>
>