[DRMAA-WG] Load average interval ?

Fri Mar 26 10:21:53 CDT 2010

2010/3/26 Daniel Templeton <daniel.templeton at oracle.com>:
> I think slots is a concept that is out of scope for DRMAA.  There's
> absolutely zero value in reporting slot counts in SGE unless you're also
> going to report the queue configurations and resource policies, because the
> total number of slots is almost never available for simultaneous use.
i meant, the number of slots for "simultaneous use", i.e. the
system/machine capacity counted as maximum number of single process
jobs allowed to run concurrently. Sorry, but i'm slightly confused
with the "slots is a concept that is out of scope DRMAA" (i read here
DRMAA as DRMS API): What you are giving upon submission as an argument
of the parallel environment: cores/cpu?
>
> Daniel
>
> On 03/26/10 07:42, Mariusz Mamoński wrote:
>>
>> On 26 March 2010 15:36, Daniel Templeton<daniel.templeton at oracle.com>
>>  wrote:
>>>
>>> The concept of slots in SGE is only loosely bound to CPU architecture.
>>> We assume a slot per thread or core, but it's only a suggestion.
>>> Administrators can configure an arbitrary number of slots.  For example,
>>> the 1-node test cluster I have running on my workstation current has
>>> over 200 slots on a dual-core machine.
>>
>> Is it common to observe production system that permits oversubsription
>> of cpus? We can always add slots as machineInfo attribute additional
>> (or instead of) to cpu/cores.
>>>
>>> Daniel
>>>
>>> On 03/25/10 17:09, Andre Merzky wrote:
>>>>
>>>> Quoting [Peter Tr?ger] (Mar 26 2010):
>>>>>
>>>>> Condor usually reports the number of cores incl. hyperthreaded ones,
>>>>> which confirms to the 'concurrent threads' metric Daniel proposed. To
>>>>> my (negative) surprise, they report nothing else:
>>>>>
>>>>> http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#16294
>>>>>
>>>>> When we only look into this case, the according attribute could be
>>>>> named 'supportedSlots', since we created the understanding of slots as
>>>>> resources for concurrent job activities / threads / processes. The
>>>>> sockets attribute would not be implementable in Condor. The value of
>>>>> the cores attribute could be guessable (supportedSlots/2).
>>>>
>>>> Please don't hardcode that number '2': that is only valid for Intels
>>>> Hyperthreading, and only at this point in time... ;-)
>>>>
>>>> Anyway: if one has to chose, the hardware threads are likely more
>>>> useful than cores, IMHO, although learning both, or even the full
>>>> hierarchy (nodes/sockets/cores/threads) would be simply nice...
>>>>
>>>> Best, Andre.
>>>>
>>>>
>>>>>
>>>>> But Condor is not our primary use case ;-)
>>>>>
>>>>> /Peter.
>>>>>
>>>>>
>>>>> Am 25.03.2010 um 16:50 schrieb Daniel Gruber:
>>>>>
>>>>>> I would also vote for the total amount of cores and sockets :)
>>>>>>
>>>>>> We could also think about reporting the amount of concurrent
>>>>>> threads that are supported by the hardware (hyperthreading in
>>>>>> case of Intel or chip-multithreading in case of Sun T2 processors).
>>>>>> This could prevent the user for puzzling out what is meant by
>>>>>> a core (is it a real one, or the hyperthreading/CMT thing).
>>>>>>
>>>>>> If not we should at least define that a core is really a physical
>>>>>> core.
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> On 03/25/10 15:44, Daniel Templeton wrote:
>>>>>>>
>>>>>>> I would tend to agree that total core count is more useful.  SGE also
>>>>>>> reports socket count as of 6.2u5, by the way.  (That's actually
>>>>>>> thanks
>>>>>>> to our own Daniel Gruber.)
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> On 03/25/10 07:03, Mariusz Mamo??ski wrote:
>>>>>>>
>>>>>>>> Also for me. As we are talking about monitoring interface i propose
>>>>>>>> two more changes to the machine monitoring interface:
>>>>>>>>
>>>>>>>> 1. Having a new data struct called "MachineInfo" with attributes
>>>>>>>> like
>>>>>>>> Load, PhysMemory, ... and getMachineInfo(in String machineName)
>>>>>>>> method
>>>>>>>> in the Monitoring interface. Rationale: the same as for the JobInfo
>>>>>>>> (consistency issue, fetching all machines attributes at once is more
>>>>>>>> natural in DRMS APIs then querying for each attribute separately)
>>>>>>>>
>>>>>>>> 2. change machineCoresPerSocket to machinesCores, if one have
>>>>>>>> machineSockets he or she can easily determine the
>>>>>>>> machineCoresPerSocket. The problem with the current API is that if
>>>>>>>> the
>>>>>>>> DRM do not support "machineSockets" (as far i checked only LSF
>>>>>>>> provide
>>>>>>>> this two-level granularity @see Google Doc) we loose the most
>>>>>>>> essential information: "how many single processing units do we
>>>>>>>> have on
>>>>>>>> single machine?"
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> On 23 March 2010 23:00, Daniel
>>>>>>>> Templeton<daniel.templeton at oracle.com>     wrote:
>>>>>>>>
>>>>>>>>> That's fine with me.
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>>>>>> On 03/23/10 13:51, Peter Tröger wrote:
>>>>>>>>>
>>>>>>>>>>> Any non-SGE opinion ?
>>>>>>>>>>>
>>>>>>>>>> Here is mine:
>>>>>>>>>>
>>>>>>>>>> I could only find one single source that explains the load average
>>>>>>>>>> source in Condor :)
>>>>>>>>>>
>>>>>>>>>> http://www.patentstorm.us/patents/5978829/description.html
>>>>>>>>>>
>>>>>>>>>> Condor provides only the 1-minute load average from the uptime
>>>>>>>>>> command.
>>>>>>>>>>
>>>>>>>>>> Same holds for Moab:
>>>>>>>>>>
>>>>>>>>>> http://www.clusterresources.com/products/mwm/docs/commands/checknode.shtml
>>>>>>>>>>
>>>>>>>>>> And PBS:
>>>>>>>>>>
>>>>>>>>>> http://wiki.egee-see.org/index.php/Installing_and_configuring_guide_for_MonALISA
>>>>>>>>>>
>>>>>>>>>> And MAUI:
>>>>>>>>>> https://psiren.cs.nott.ac.uk/projects/procksi/wiki/JobManagement
>>>>>>>>>>
>>>>>>>>>> I vote for reporting only the 1-minute load average.
>>>>>>>>>>
>>>>>>>>>> /Peter.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> And BTW, by using the uptime(1) load semantics, we loose Windows
>>>>>>>>>>> support. There is no such attribute there, load is measured in
>>>>>>>>>>> percentage of non-idle time, and has no direct relationship to
>>>>>>>>>>> the
>>>>>>>>>>> ready queue lengths.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Peter.
>>>>>>>>>>>
>>>>>>>>>>> Am 22.03.2010 um 16:02 schrieb Daniel Templeton:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> SGE tends to look at the 5-minute average, although any can be
>>>>>>>>>>>> configured.  You could solve it the same way we did for SGE --
>>>>>>>>>>>> offer
>>>>>>>>>>>> three: machineLoadShort, machineLoadMed, machineLoadLong.
>>>>>>>>>>>>
>>>>>>>>>>>> Daniel
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/22/10 06:05, Peter Tröger wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> next remaining thing from OGF28:
>>>>>>>>>>>>>
>>>>>>>>>>>>> We support the determination of machineLoad average in the
>>>>>>>>>>>>> MonitoringSession interface. At OGF, we could not agree on
>>>>>>>>>>>>> which of
>>>>>>>>>>>>> the typical intervals (1/5/15 minutes) we want to use here.
>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>> all of them ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Peter.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>     drmaa-wg mailing list
>>>>>>>>>>>>>     drmaa-wg at ogf.org
>>>>>>>>>>>>>     http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> drmaa-wg mailing list
>>>>>>>>>>>> drmaa-wg at ogf.org
>>>>>>>>>>>> http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>     drmaa-wg mailing list
>>>>>>>>>>>     drmaa-wg at ogf.org
>>>>>>>>>>>     http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>      drmaa-wg mailing list
>>>>>>>>>>      drmaa-wg at ogf.org
>>>>>>>>>>      http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>    drmaa-wg mailing list
>>>>>>>>>    drmaa-wg at ogf.org
>>>>>>>>>    http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>    drmaa-wg mailing list
>>>>>>>    drmaa-wg at ogf.org
>>>>>>>    http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>   drmaa-wg mailing list
>>>>>>   drmaa-wg at ogf.org
>>>>>>   http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>
>>> --
>>>  drmaa-wg mailing list
>>>  drmaa-wg at ogf.org
>>>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>>>
>>
>>
>>
>
>


-- 
Mariusz