[glue-wg] GLUE2.1 draft: request for final discussion and approval

Mon Jun 4 05:46:52 EDT 2018

On 06/01/2018 11:39 AM, Stephen Burke - UKRI STFC wrote:
> A few quick comments ...
>
> Generally it would be good to highlight all changes, e.g. new class definitions.
>
> You have new classes called ComputingShareAcceleratorInfo and ComputingManagerAcceleratorInfo - I wonder if it's possible to have a better name than Info, although perhaps you might want to be able to put new information there. Also having two slightly different objects both called Info may be a bit confusing. And you don't need Accelerator in the attribute names, e.g. TotalAcceleratorSlots, it makes them longer and they're scoped by the class name anyway. 
Even better, I think we can remove the suffix "Info" from both the class
names.
>
> In ManagerInfo you have TotalPhysicalAccelerators - why not Logical too? It also isn't clear to me what the relationship is between a card and a slot - for CPUs there is no particular relationship, the OS takes care of scheduling processes to CPUs, but I don't know how GPUs are used.
>
> The relationship between AcceleratorEnvironment and ExecutionEnvironment is *-*, which is complicated to implement and navigate and seems like overkill to me. At least I'd make it *-1 and if you happen to have the same AE info for multiple EEs you just publish the same thing multiple times. I also somewhat wonder if you would really have multiple Accelerator types in one EE, i.e. several different GPU types in a single WN - if not you could just add the attributes to the EE and not have a new class at all. ComputeCapability would be better as an open enumeration if you expect software to be able to use it as a selection key.

It's not easy to design a model that can be suitable for any kind of
future accelerator devices.
We can have multiple (different) cards on the same WN, and we have some
examples about that.
I read about new architecture with multi-chip GPU on the same card.
There're GPU appliances (Nvidia Quadro VCA) that can be shared among
different WNs, and I'm not sure that cloning the AcceleratorEnvironment
could get the point.

The concept of "slot" is not so clear as the one defined for CPU, but in
any case it should represent the minimum amount of resource usage for a
job on a given device.
The card, or socket, is just a physical description of the device.

For what concerns the new AI/ML accelerator devices, like the Google
TPU, honestly I'm not skilled enough for trying to propose a model.

-- 
----------------------
Ing. Paolo Andreetto
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy

Tel: +39 049.967.7378
Skype: andreettopl
----------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/glue-wg/attachments/20180604/7e5de4b8/attachment.html>