[jsdl-wg] Questions and potential changes to JSDL, as seen from HPC Profile point-of-view

Fri Jun 9 04:45:16 CDT 2006

Marvin Theimer wrote:
> Coming from the point-of-view of the HPC Profile working group, I have 
> several questions about JSDL, as well as some straw man thoughts about 
> how JSDL should/could relate to the HPC Profile specification that I’m 
> involved with.  Some of my questions lead me to restrictions on JSDL 
> that an HPC profile specification might make.  Other questions lead to 
> potential changes that might be made as part of creating future versions 
> of JSDL.  (I’m well aware that JSDL 1.0 was meant as a starting point 
> rather than the final word on job submission descriptions and so please 
> interpret my questions as being an attempt at constructive suggestions 
> rather than a criticism of a very fine first step by the JSDL working 
> group.)

I'm going to work through these things as I read through them, so the
answers (well, my answers) might be a little disjointed. :-)

> At a high level, there are several general questions that came up when 
> reading the JSDL 1.0 specification:
> 
> ·        Can JSDL documents describe jobs other than Linux/Unix/Posix 
> jobs?  For example, things like mount points and mount sources do not 
> map in a completely straight-forward manner to how file systems are 
> provided in the Windows world.

Most certainly. The intent is that ultimately JSDL jobs should be able
to describe pretty much any request for an atomic activity, and the
POSIXApplication stuff was just a seed so that at least one common case
would be handled by the initial specification. Work is ongoing with an
extension to that to support parallel (mainly MPI, but also some other
archtectures too) jobs, and we've had in mind other kinds of jobs for a
while (including SQL jobs, Web-service invokation jobs, and JVM jobs,
but obviously not limited to those).

On the matter of mount points, the interpretation of a mount source is
not that the mount source should be mounted at the mount point, but
rather that the job should fail if the mount is not present. Now, a JSDL
consumer might react to that failure by trying to perform the mount, but
it is not required. (The meaning of the name of the mount source is not
defined IIRC, though it probably ought to be URI-like, meaning that SMB
mounts would work fine under windows with suitable munging.)

We'd hope that most jobs would not actually specify the mount point, but
would instead use the facilities provided by the JSDL abstract file
system processing semantics to adapt to whatever was available.

> ·        Is JSDL expressive enough to describe all the needs of a job?  
> For example, it is unclear how one would specify a requirement for 
> something like a particular instruction set variation of the IA86 
> architecture (e.g. the SSE3 version of the Pentium) or how one would 
> specify that AMD processors are required rather than Intel ones (because 
> the optimized libraries and the optimizations generated by the compiler 
> used will differ for each).  For another example, it is unclear how one 
> would specify that all the compute nodes used for something like an MPI 
> job should have the same hardware.

I think with processor types we just grabbed a snapshot of the CIM model
and went with that; updating to use a later version of that would not
cause great difficulty (though the reverse problem might then exist, in
that it might become more difficult to say that any kind of x86 arch is
OK for a particular job).

However, I believe we would assume the following interpretation of
processor requirements: if specified, that's what they want for all
processors associated with the job. If they didn't specify, they didn't
care and anything is therefore good enough.

> ·        How will JSDL’s normative set of enumeration values for things 
> like processor architecture and operating system be kept up-to-date and 
> relevant?  Also, how should things like operating system version get 
> specified in a normative manner that will enable interoperability among 
> multiple clients and job scheduling services?  For example, things like 
> Linux and Windows versions are constantly being introduced, each with 
> potentially significant differences in capabilities that a job might 
> depend on.  Without a normative way of specifying these constantly 
> evolving version sets it will be difficult, if not impossible, to create 
> interoperable job submission clients and job scheduling services 
> (including meta-scheduling services where multiple schedulers must 
> interoperate with each other).

I don't know. :-) Maybe we should say that additional things as defined
in some other model (e.g. CIM) SHOULD be accepted? (As I said above, we
just took a snapshot of that model; updating isn't really a big deal.)

> ·        Although JSDL specifies a means of including additional 
> non-normative elements and attributes in a document, non-normative 
> extensions make interoperability difficult.  This implies the need for 
> normative extensions to JSDL beyond the Posix extension currently 
> described in the 1.0 specification.  Are there plans to define 
> additional extension profiles to address the above questions surrounding 
> expressive power and normative descriptions of things like current OS 
> types and versions?

We do not currently have *specific* plans to do this, but that does not
mean we cannot have such specific plans in fairly short order. :-)

> ·        If one accepts the need for a variety of extension profiles 
> then this raises the question of what should be in the base case.  For 
> example, it could be argued that data staging – with its attendant 
> aspects such as mount points and mount sources – should be defined in an 
> extension rather than in the core specification that will need to cover 
> a variety of systems beyond just Linux/Unix/Posix.  Similarly, one might 
> argue that the base case should focus on what’s /functionally/ necessary 
> to execute a job correctly and should leave things that are 
> “optimization hints”, such as CPU speed and network bandwidth 
> specifications, to extension profiles.

Sounds fairly reasonable, though the abstract filesystem stuff has real
uses in that it makes it much easier to write a job request that deals
with things like varying locations of home directories and scratch
space. The alternative is to assume that temporary files are always
written to somewhere like /tmp, immediately stuffing interop even
between Unix-based HPC centres (we don't write large files to /tmp here
because that's not a cluster-wide resource and is therefore not very
useful) let alone with any Windows-based service.

But it is entirely reasonable to support mount points and sources by
saying things like "if it doesn't match my current configuration, I'll
fault". That is most certainly a legal interpretation of how to process
a JSDL document. This is probably an issue that ought to be covered in
the primer, when we finally write it. :-)

> ·        How are concepts such as IndividualCPUSpeed and 
> IndividualNetworkBandwidth intended to be defined and used in practice?  
> I understand the concept of specifying things like the amount of 
> physical memory or disk space that a job will require in order to be 
> able to run.  However, CPU speed and network bandwidth don’t represent 
> functional requirements for a job – meaning that a job will correctly 
> run and produce the same results irrespective of the CPU speed and 
> network bandwidth available to it.  Also, the current definitions seem 
> fuzzy: the megahertz number for a CPU does not tell you how fast a given 
> compute node will be able to execute various kinds of jobs, given all 
> the various hardware factors that can affect the performance of a 
> processor (consider the presence/absence of floating point support, the 
> memory caching architecture, etc.).  Similarly, is network bandwidth 
> meant to represent the theoretical maximum of a compute node’s network 
> interface card?  Is it expected to take into account the performance of 
> the switch that the compute node is attached to?  Since switch 
> performance is partially a function of the pattern of (aggregate) 
> traffic going through it, the network bandwidth that a job such as an 
> MPI application can expect to receive will depend on the /type/ of 
> communications patterns employed by the application.  How should this 
> aspect of network bandwidth be reflected – if at all – in the network 
> bandwidth values that a job requests and that compute nodes advertise?

CPU speed is a fairly meaningless value really, since it is at best only
a poor approximant to application performance (which is what people are
really interested in) though app-perf is not portable in any sensible
way as you can't extrapolate from the performance of one application to
that of another. But it's probably the best we've got (we could do FLOPS
or MIPS instead I suppose, but I suspect neither is much better).

Network bandwidth is worse, because it is only meaningful when defined
with respect to a defined pair of endpoints (or, more particularly here,
w.r.t. a defined remote endpoint, since the other one is defined by
where the job is submitted to). What's worse is that latency isn't
defined at all, and that's at least as important for complex apps. In
short, I think we didn't get the network bandwidth right. :-\

However, the general policy of accepting quality-of-service requirements
on resources is one I agree with, since they really do matter and they
are constraints on whether a particular resource is fit for the user's
purpose.

> ·        JSDL is intended for describing the requirements of a job being 
> submitted for execution.  To enable matchmaking between submitted jobs 
> and available computational resources there must also be a way of 
> describing existing/available resources.  While much of JSDL can be used 
> for this purpose, it is also clear that various extensions are 
> necessary.  For example, to describe a compute cluster requires that one 
> be able to specify the resources for each compute node in the cluster 
> (which may be a heterogeneous lot).  Similarly, to describe a compute 
> node with multiple network interfaces would require an extension to the 
> current model, which assumes that only a single instance of such things 
> can exist.  This raises the question of whether something other than 
> JSDL is intended to be used for describing available computational 
> resources or whether there are intensions to extend JSDL to enable it to 
> describe such resources. 

Strictly this is outside the scope of JSDL, where we've stuck firmly to
the niche of describing user requests and not the things with which
those requests may be satisfied. However, I do have some ideas on this. :-)

JSDL terms can indeed be used for resource description, and this is
because you can interpret them as saying something like "this is the
maximal set of processors I will allocate to any job you submit".

The UniGrids project has looked at several ways to do such resource
descriptions based over JSDL. The simplest model we've found was to say
that each target system service (BES-analog) supports a single unified
homogenous resource description, and that where we have a heterogenous
cluster we describe that as multiple services, each with smaller claims
of range of resources allocated to it. This allows for a simple resource
model and matching rules, but it covers the 90% case neatly.

Let me flesh that out with an example. Suppose we have a cluster of
machines, four from Intel (with 2GB memory each) and four from AMD (two
with 1GB, two with 4GB). This induces 5 services, with resource claims
as follows:

  * 2 AMD processors, 4GB
  * 4 AMD processors, 1GB
  * 4 Intel processors, 2GB
  * 6 x86 processors, 2GB
  * 8 x86 processors, 1GB

It should be noted that these separate services woud actually be pretty
cheap in our implementation, since we can host them in the same
container at a cost of a few extra objects. :-)

Maybe other approaches would be better, but the matter of resource
description is politically tricky for this WG since it gets into space
claimed by others.

> ·        The current specification stipulates that conformant 
> implementations must be able to parse all the elements and attributes 
> defined in the spec, but doesn’t require that any of them be supplied.  
> Thus, a scheduling service that does nothing could claim to be compliant 
> as long as it can correctly parse JSDL documents.  For interoperability 
> purposes, I would argue that the spec should define a minimum set of 
> elements that any compliant service must be able to supply. Otherwise 
> clients will not be able to make any assumptions about what they can 
> specify in a JSDL document and, in particular, client applications that 
> programmatically submit job submission requests will not be possible 
> since they can’t assume that any valid JSDL document will actually be 
> acceptable by any given job submission service.

I'd argue that this profiling of JSDL should be done by BES or
yourselves (the HPC profile). This is because there are other cases
(e.g. as synchronization points in workflow processing) where null jobs
are actually useful.

> ·        I have a number of questions about data staging:

I have one major observation: the data staging stuff is known to be a
long way off imperfect.

> ·        Although the notions of working directory and environment 
> variables are defined in the posix extension, they are implicitly 
> assuming in the data staging section of the core specification.  This 
> implies to me that either (a) data staging is made an extension or (b) 
> these concepts are made a normative, required part of the core 
> specification.

Good point. I suppose our response to this should be contingent on
whether "context location" (i.e. working directory) can be defined for
all currently conceived-of job types. I don't know how to answer this
yet. It's certainly possible for many of the things we've identified,
but all?

> ·        Recursive directory copying can be specified, but is not 
> required to be supplied by any job submission service.  This makes it 
> difficult to write applications that programmatically define their data 
> staging needs since they cannot in the current design determine whether 
> any given job submission service implements recursive directory 
> copying.  In practice this may mean that programmatically generated job 
> submissions will only ever use lists of individual files to stage. 

It means that only _interoperable_ ones will do that, but I think there
are already implementations of directory staging out there and clients
that are generating jobs that use it. I may be wrong though. :-)

> ·        The current definitions of the well-known file systems seem 
> imprecise to me.  In particular:
> 
> ·        What are the navigation rules associated with each?  Can you cd 
> out of the subtree that each represents?  ROOT almost certainly does not 
> allow that.  Is there an assumption that one can cd out of HOME or TMP 
> or SCRATCH?  Hopefully not, since that would make these file systems 
> even more Unix/Linux-centric, plus one would now need to specify what 
> clients can expect to see when they do so.

We don't specify. Portable applications don't change directory at all in
my experience; it's too full of strange behaviour as the meaning of all
relative paths change...

> ·        What is ROOT intended to be used for?  Are there assumptions 
> about what resides under root?  Are there assumptions about what an 
> application can read/write under the ROOT subtree?  (ROOT also seems 
> like the most Unix-specific of the 4 file system types defined.)

Fair points, and I'd usually assume that the root FS was not writable.
It probably is fairly Unix-specific. But it does make life much easier
for integrating with legacy job systems which can handle the other FS
types by translation into the root and adding a prefix to the paths.
FWIW, I wouldn't use ROOT in my jobs. :-)

> ·        What are the sharing/consistency semantics of each file system 
> in situations where a job is a multi-node application running on 
> something like a cluster?  Is HOME visible to all compute nodes in a 
> data-consistent manner?  I’m guessing that TMP would be assumed to be 
> strictly local to each compute node, so that things like MPI 
> applications would need to be cognizant that they are writing multiple 
> files to multiple separate storage systems when they write to a file in 
> TMP – and furthermore that data staging of such files after a job has 
> run will result in multiple files that all map to the same target file.

I've been assuming that (or at least configuring our local systems so 
that) TMP was node-local and SCRATCH was cluster-wide.

> ·        Can other users write over or delete your data in TMP and/or 
> SCRATCH?  Is data in these file systems visible to other users or does 
> each job get its own private TMP and SCRATCH?

I'd assume that other users never can overwrite your data and wouldn't
make any assumptions at all about the level of isolation of either TMP
or SCRATCH with respect to other jobs owned by the same user. But that
would make an excellent topic to be included in any system policy
statement. (Another policy might be that your job submission has to be
digitally signed and the signer's certificate has to be signed in turn
by a particular CA.)

It might be a good idea to codify some best practice on this in the HPC
profile.

> ·        How long does data in SCRATCH stay around?  Without some 
> normative definition – or at least a normative lower bound – on data 
> lifetime clients will have to assume that the data can vanish 
> arbitrarily and things like multi-job workflows will be very difficult 
> to write if they try to take advantage of SCRATCH space to avoid 
> unnecessary data staging actions to/from a computing facility.

Again, that's something that is a site policy (I think we've locally got 
a "one month after last use, with some fairly coarse granularity" 
policy). However, grid systems bring something to the table here in that 
by describing jobs as resources in their own right (with definite known 
lifespans) it should be possible to design systems that make better 
decisions over when a piece of temporary data has become unreferenced 
and may be deleted.

Profiling some best practice here seems sensible.

> ·        From an interoperability and programmatic submission 
> point-of-view, it is important to know which transports any given job 
> submission service can be expected to support.  This seems like another 
> area where a normative minimal set that all job submission services must 
> implement needs to be defined.

Agreed, but this is something that we basically punted on. (Also, the
notion of what is a source or destination for a staging action turns out
to be messy sometimes. Alas.)

> Given these questions, as well as the mandate for the HPC profile to 
> define a simple base interface (that can cover the HPC use case of 
> submitting jobs to a compute cluster), I would like to present the 
> following straw man proposal for feedback from this community:
> 
> ·        Restructure the JSDL specification as a small core 
> specification that must be universally implemented – i.e. not just 
> parsable, but also suppliable by all compliant job submission services – 
> and a number of optional extension profiles.

Sounds sensible.

> ·        Declare concepts such as executable path, command-line 
> arguments, environment variables, and working directory to be generic 
> and include them in the core JSDL specification rather than the posix 
> extension.  This may enable the core specification to support things 
> like Windows-based jobs (TBD).  The goal here is to define a core JSDL 
> specification that in-and-of-itself could enable job submission to a 
> fairly wide range of execution subsystems, including both the 
> Unix/Linux/Posix world and the Windows world.

Again, it's not quite clear to me that all those concepts are meaningful
in all job types (as opposed to those that are clearly just a way to
execute some binary with a bunch of arguments).

> ·        Move data staging to an extension.

I'm not sure about this.

> ·        Create precise definitions of the various concepts introduced 
> in the data staging extension, including normative requirements about 
> whether or not one can change directory up and out of a file system’s 
> root directory, etc.

Good idea.

> ·        Define which transports are expected to be implemented by all 
> compliant services.

Very good idea.

> ·        Move the various enumeration types – e.g. for CPU architecture 
> and OS – to separate specification documents so that they can evolve 
> without requiring corresponding and constant revision of the core JSDL 
> specification.

Excellent idea. :-)

> ·        Define extension profiles (eventually, not right away) that 
> enable richer description of hardware and software requirements, such as 
> details of the CPU architecture or OS capabilities.  As part of this, 
> move optimization hints, such as CPU speed and network bandwidth 
> elements out of the JSDL core and into a separate extension profile.

Sounds pretty sensible to me.

> ·        Embrace the issue of how to specify available resources at an 
> execution subsystem.  Start by defining a base case that allows the 
> description of compute clusters by creating a compound JSDL document 
> that consists of an outer element that ties together a sequence of 
> individual JSDL elements, each of which describes a single compute node 
> of a compute cluster.  Define an explicit notion of extension profiles 
> that could define other ways of describing computational resources 
> beyond just an array of simple JSDL descriptions.

Interesting. Probably a good topic for discussion going forward.

> Now, as presented above, my straw man proposal looks like suggestions 
> for changes that might go into a JSDL-1.1 or JSDL-2.0 specification.  In 
> the near-term, the HPC profile working group will be exploring what can 
> be done with just JSDL-1.0 and restrictions to that specification.  The 
> restrictions would correspond to disallowing those parts of the JSDL-1.0 
> specification that the above proposal advocates moving to extension 
> profiles.  It will also explore whether a restricted version of the 
> posix extension could be used to cover most common Windows cases.

Sounds like a reasonable plan to me.

Donal.