[ogsa-bes-wg] Questions and potential changes to JSDL, as seen from HPC Profile point-of-view

Thu Jun 8 19:05:50 CDT 2006

Hi;

Coming from the point-of-view of the HPC Profile working group, I have
several questions about JSDL, as well as some straw man thoughts about
how JSDL should/could relate to the HPC Profile specification that I'm
involved with.  Some of my questions lead me to restrictions on JSDL
that an HPC profile specification might make.  Other questions lead to
potential changes that might be made as part of creating future versions
of JSDL.  (I'm well aware that JSDL 1.0 was meant as a starting point
rather than the final word on job submission descriptions and so please
interpret my questions as being an attempt at constructive suggestions
rather than a criticism of a very fine first step by the JSDL working
group.)

At a high level, there are several general questions that came up when
reading the JSDL 1.0 specification:

*        Can JSDL documents describe jobs other than Linux/Unix/Posix
jobs?  For example, things like mount points and mount sources do not
map in a completely straight-forward manner to how file systems are
provided in the Windows world.

*        Is JSDL expressive enough to describe all the needs of a job?
For example, it is unclear how one would specify a requirement for
something like a particular instruction set variation of the IA86
architecture (e.g. the SSE3 version of the Pentium) or how one would
specify that AMD processors are required rather than Intel ones (because
the optimized libraries and the optimizations generated by the compiler
used will differ for each).  For another example, it is unclear how one
would specify that all the compute nodes used for something like an MPI
job should have the same hardware.

*        How will JSDL's normative set of enumeration values for things
like processor architecture and operating system be kept up-to-date and
relevant?  Also, how should things like operating system version get
specified in a normative manner that will enable interoperability among
multiple clients and job scheduling services?  For example, things like
Linux and Windows versions are constantly being introduced, each with
potentially significant differences in capabilities that a job might
depend on.  Without a normative way of specifying these constantly
evolving version sets it will be difficult, if not impossible, to create
interoperable job submission clients and job scheduling services
(including meta-scheduling services where multiple schedulers must
interoperate with each other).

*        Although JSDL specifies a means of including additional
non-normative elements and attributes in a document, non-normative
extensions make interoperability difficult.  This implies the need for
normative extensions to JSDL beyond the Posix extension currently
described in the 1.0 specification.  Are there plans to define
additional extension profiles to address the above questions surrounding
expressive power and normative descriptions of things like current OS
types and versions?

*        If one accepts the need for a variety of extension profiles
then this raises the question of what should be in the base case.  For
example, it could be argued that data staging - with its attendant
aspects such as mount points and mount sources - should be defined in an
extension rather than in the core specification that will need to cover
a variety of systems beyond just Linux/Unix/Posix.  Similarly, one might
argue that the base case should focus on what's functionally necessary
to execute a job correctly and should leave things that are
"optimization hints", such as CPU speed and network bandwidth
specifications, to extension profiles.

*        How are concepts such as IndividualCPUSpeed and
IndividualNetworkBandwidth intended to be defined and used in practice?
I understand the concept of specifying things like the amount of
physical memory or disk space that a job will require in order to be
able to run.  However, CPU speed and network bandwidth don't represent
functional requirements for a job - meaning that a job will correctly
run and produce the same results irrespective of the CPU speed and
network bandwidth available to it.  Also, the current definitions seem
fuzzy: the megahertz number for a CPU does not tell you how fast a given
compute node will be able to execute various kinds of jobs, given all
the various hardware factors that can affect the performance of a
processor (consider the presence/absence of floating point support, the
memory caching architecture, etc.).  Similarly, is network bandwidth
meant to represent the theoretical maximum of a compute node's network
interface card?  Is it expected to take into account the performance of
the switch that the compute node is attached to?  Since switch
performance is partially a function of the pattern of (aggregate)
traffic going through it, the network bandwidth that a job such as an
MPI application can expect to receive will depend on the type of
communications patterns employed by the application.  How should this
aspect of network bandwidth be reflected - if at all - in the network
bandwidth values that a job requests and that compute nodes advertise?

*        JSDL is intended for describing the requirements of a job being
submitted for execution.  To enable matchmaking between submitted jobs
and available computational resources there must also be a way of
describing existing/available resources.  While much of JSDL can be used
for this purpose, it is also clear that various extensions are
necessary.  For example, to describe a compute cluster requires that one
be able to specify the resources for each compute node in the cluster
(which may be a heterogeneous lot).  Similarly, to describe a compute
node with multiple network interfaces would require an extension to the
current model, which assumes that only a single instance of such things
can exist.  This raises the question of whether something other than
JSDL is intended to be used for describing available computational
resources or whether there are intensions to extend JSDL to enable it to
describe such resources.  

*        The current specification stipulates that conformant
implementations must be able to parse all the elements and attributes
defined in the spec, but doesn't require that any of them be supplied.
Thus, a scheduling service that does nothing could claim to be compliant
as long as it can correctly parse JSDL documents.  For interoperability
purposes, I would argue that the spec should define a minimum set of
elements that any compliant service must be able to supply. Otherwise
clients will not be able to make any assumptions about what they can
specify in a JSDL document and, in particular, client applications that
programmatically submit job submission requests will not be possible
since they can't assume that any valid JSDL document will actually be
acceptable by any given job submission service.

*        I have a number of questions about data staging:

*        Although the notions of working directory and environment
variables are defined in the posix extension, they are implicitly
assuming in the data staging section of the core specification.  This
implies to me that either (a) data staging is made an extension or (b)
these concepts are made a normative, required part of the core
specification.

*        Recursive directory copying can be specified, but is not
required to be supplied by any job submission service.  This makes it
difficult to write applications that programmatically define their data
staging needs since they cannot in the current design determine whether
any given job submission service implements recursive directory copying.
In practice this may mean that programmatically generated job
submissions will only ever use lists of individual files to stage.  

*        The current definitions of the well-known file systems seem
imprecise to me.  In particular:

*        What are the navigation rules associated with each?  Can you cd
out of the subtree that each represents?  ROOT almost certainly does not
allow that.  Is there an assumption that one can cd out of HOME or TMP
or SCRATCH?  Hopefully not, since that would make these file systems
even more Unix/Linux-centric, plus one would now need to specify what
clients can expect to see when they do so.

*        What is ROOT intended to be used for?  Are there assumptions
about what resides under root?  Are there assumptions about what an
application can read/write under the ROOT subtree?  (ROOT also seems
like the most Unix-specific of the 4 file system types defined.)

*        What are the sharing/consistency semantics of each file system
in situations where a job is a multi-node application running on
something like a cluster?  Is HOME visible to all compute nodes in a
data-consistent manner?  I'm guessing that TMP would be assumed to be
strictly local to each compute node, so that things like MPI
applications would need to be cognizant that they are writing multiple
files to multiple separate storage systems when they write to a file in
TMP - and furthermore that data staging of such files after a job has
run will result in multiple files that all map to the same target file.

*        Can other users write over or delete your data in TMP and/or
SCRATCH?  Is data in these file systems visible to other users or does
each job get its own private TMP and SCRATCH?

*        How long does data in SCRATCH stay around?  Without some
normative definition - or at least a normative lower bound - on data
lifetime clients will have to assume that the data can vanish
arbitrarily and things like multi-job workflows will be very difficult
to write if they try to take advantage of SCRATCH space to avoid
unnecessary data staging actions to/from a computing facility.

*        From an interoperability and programmatic submission
point-of-view, it is important to know which transports any given job
submission service can be expected to support.  This seems like another
area where a normative minimal set that all job submission services must
implement needs to be defined.

Given these questions, as well as the mandate for the HPC profile to
define a simple base interface (that can cover the HPC use case of
submitting jobs to a compute cluster), I would like to present the
following straw man proposal for feedback from this community:

*        Restructure the JSDL specification as a small core
specification that must be universally implemented - i.e. not just
parsable, but also suppliable by all compliant job submission services -
and a number of optional extension profiles.

*        Declare concepts such as executable path, command-line
arguments, environment variables, and working directory to be generic
and include them in the core JSDL specification rather than the posix
extension.  This may enable the core specification to support things
like Windows-based jobs (TBD).  The goal here is to define a core JSDL
specification that in-and-of-itself could enable job submission to a
fairly wide range of execution subsystems, including both the
Unix/Linux/Posix world and the Windows world.

*        Move data staging to an extension.

*        Create precise definitions of the various concepts introduced
in the data staging extension, including normative requirements about
whether or not one can change directory up and out of a file system's
root directory, etc.

*        Define which transports are expected to be implemented by all
compliant services.

*        Move the various enumeration types - e.g. for CPU architecture
and OS - to separate specification documents so that they can evolve
without requiring corresponding and constant revision of the core JSDL
specification.

*        Define extension profiles (eventually, not right away) that
enable richer description of hardware and software requirements, such as
details of the CPU architecture or OS capabilities.  As part of this,
move optimization hints, such as CPU speed and network bandwidth
elements out of the JSDL core and into a separate extension profile.

*        Embrace the issue of how to specify available resources at an
execution subsystem.  Start by defining a base case that allows the
description of compute clusters by creating a compound JSDL document
that consists of an outer element that ties together a sequence of
individual JSDL elements, each of which describes a single compute node
of a compute cluster.  Define an explicit notion of extension profiles
that could define other ways of describing computational resources
beyond just an array of simple JSDL descriptions.

Now, as presented above, my straw man proposal looks like suggestions
for changes that might go into a JSDL-1.1 or JSDL-2.0 specification.  In
the near-term, the HPC profile working group will be exploring what can
be done with just JSDL-1.0 and restrictions to that specification.  The
restrictions would correspond to disallowing those parts of the JSDL-1.0
specification that the above proposal advocates moving to extension
profiles.  It will also explore whether a restricted version of the
posix extension could be used to cover most common Windows cases.

Marvin.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/ogsa-bes-wg/attachments/20060608/01411a9d/attachment.htm