[ogsa-wg] Paper proposing "evolutionary vertical design efforts"

Mon Mar 13 16:42:27 CST 2006

Hi;

Ian, you are correct that I view job submission to a cluster as being
one of the simplest, and hence most basic, HPC use cases to start with.
Or, to be slightly more general, I view job submission to a "black box"
that can run jobs - be it a cluster or an SMP or an SGI NUMA machine or
what-have-you - as being the simplest and hence most basic HPC use case
to start with.  The key distinction for me is that the internals of the
"box" are for the most part not visible to the client, at least as far
as submitting and running compute jobs is concerned.  There may well be
a separate interface for dealing with things like system management, but
I want to explicitly separate those things out in order to allow for use
of "boxes" that might be managed by proprietary means or by means
obeying standards that a particular job submission client is unfamiliar
with.

I think the use case that Ravi Subramaniam posted to this mailing list
back on 2/17 is a good one to start a discussion around.  However, I'd
like to present it from a different point-of-view than he did.  The
manner in which the use case is currently presented emphasizes all the
capabilities and services needed to handle the fully general case of
submitting a batch job to a computing utility/service.  That's a great
way of producing a taxonomy against which any given system or design can
be compared to see what it has to offer.  I would argue that the next
step is to ask what's the simplest subset that represents a useful
system/design and how should one categorize the various capabilities and
services he has identified so as to arrive at meaningful components that
can be selectively used to obtain progressively more capable systems.

Another useful exercise to do is to examine existing job scheduling
systems in order to understand what they provide.  Since in the real
world we will have to deal with the legacy of existing systems it will
be important to understand how they relate to the use cases we explore.
In the same vein, it will be important to take into account and
understand other existing infrastructures that people use that are
related to HPC use cases.  I'm thinking of things like security
infrastructures, directory services, and so forth.  From the
point-of-view of managing complexity and reducing
total-cost-of-ownership, it will be important to understand the extent
to which existing infrastructure and services can be reused rather than
reinvented.

To kick off a discussion around the topic of a minimalist HPC use case,
I present a straw man description of such below and then present a first
attempt at categorizing various areas of extension.  The categorization
of extension areas is not meant to be complete or even all that
carefully thought-out as far as componentization boundaries are
concerned; it is merely meant to be a first contribution to get the
discussion going.

A basic HPC use case: Compute cluster embedded within an organization.

*        This is your basic batch job scheduling scenario.  Only a very
basic state transition diagram is visible to the client, with the
following states for a job: queued, running, finished.  Additional
states -- and associated state transition request operations and
functionality -- are not supported.  Examples of additional states and
associated functionality include suspension of jobs and migration of
jobs.

*        Only "standard" resources can be described, for example: number
of cpus/nodes needed, memory requirements, disk requirements, etc.
(think resources that are describable by JSDL).

*        Once a job has been submitted it can be cancelled, but its
resource requests can't be modified.

*        A distributed file system is accessible from client desktop
machines and client file servers, as well as compute nodes of the
compute cluster.  This implies that no data staging is required, that
programs can be (for the most part) executed from existing file system
locations, and that no program "provisioning" is required (since you can
execute them from wherever they are already installed).  Thus in this
use case all data transfer and program installation operations are the
responsibility of the user.

*        Users already have accounts within the existing security
infrastructure (e.g. Kerberos).  They would like to use these and not
have to create/manage additional authentication/authorization
credentials (at least at the level that is visible to them).

*        The job scheduling service resides at a well-known network name
and it is aware of the compute cluster and its resources by "private"
means (e.g. it runs on the head node of the cluster and employs private
means to monitor and control the resources of the cluster).  This
implies that there is no need for any sort of directory services for
finding the compute cluster or the resources it represents other than
basic DNS.

*        Compute cluster system management is opaque to users and is the
concern of the compute cluster's owners.  This implies that system
management is not part of the compute cluster's public job scheduling
interface.  This also implies that there is no need for a logging
interface to the service.  I assume that application-level logging can
be done by means of libraries that write to client files; i.e. that
there is no need for any sort of special system support for logging.

*        A simple polling-based interface is the simplest form of
interface to something like a job scheduling service.  However, a simple
call-back notification interface is a very useful addition that
potentially provides substantial performance benefits since it can
enable the avoidance of lots of unnecessary network traffic.  Only job
state changes result in notification messages.

*        There are no notions of fault tolerance.  Jobs that fail must
be resubmitted by the client.  Neither the cluster head node nor its
compute nodes are fault tolerant.  I do expect the client software to
return an indication of failure-due-system-fault when appropriate.
(Note that this may also occur when things like network partitions
occur.)

*        One does need some notion of how to deal with orphaned
resources and jobs.  The notion of job lifetime and post-expiration
garbage collection is a natural approach here.

*        The scheduling service provides a fixed set of scheduling
policies, with only a few basic choices (or maybe even just one), such
as FIFO or round-robin.  There is no notion, in general, of SLAs (which
are a form of scheduling policy).

*        Enough information must be returned to the client when a job
finishes to enable basic accounting functionality.  This means things
like total wall-clock time the job ran and a summary of resources used.
There is not a need for the interface to support any sort of grouping of
accounting information.  That is, jobs do not need to be associated with
projects, groups, or other accounting entities and the job scheduling
service is not responsible for tracking accounting information across
such entities.  As long as basic resource utilization information is
returnable for each job, accounting can be done externally to the job
scheduling service.  I do assume that jobs can be uniquely identified by
some means and can be uniquely associated with some principal entity
existing in the overall system, such as a user name.

*        Just as there is no notion of requiring the job scheduling
service to track any but the most basic job-level accounting
information, there is no notion of the service enforcing quotas on jobs.

*        Although it is generally useful to separate the notions of
resource reservation from resource usage (e.g. to enable interactive and
debugging use of resources), it is not a necessity for the most basic of
job scheduling services.  

*        There is no notion of tying multiple jobs together, either to
support things like dependency graphs or to support things like
workflows.  Such capabilities must be implemented by clients of the job
scheduling service.

Interesting extension areas:

*                     Additional scheduling policies

o        Weighted fair-share, ...

o        Multiple queues

o        SLAs

o        ...

*                     Extended resource descriptions

o        Additional resource types, such as GPUs

o        Additional types of compute resources, such as desktop
computers

o        Condor-style class ads

*                     Extended job descriptions (as returned to
requesting clients and sys admins)

*                     Additional classes of security credentials

*                     Reservations separated from execution

o        Enabling interactive and debugging jobs

o        Support for multiple competing schedulers (incl. desktop cycle
stealing and market-based approaches to scheduling compute resources)

*                     Ability to modify jobs during their existence

*                     Fault tolerance

o        Automatic rescheduling of jobs that failed due to system faults

o        Highly available resources:  This is partly a policy statement
by a scheduling service about its characteristics and partly the ability
to rebind clients to migrated service endpoints

*                     Extended state transition diagrams and associated
functionalities

o        Job suspension

o        Job migration

o        ...

*                     Accounting & quotas

*                     Operating on arrays of jobs

*                     Meta-schedulers, multiple schedulers, and
ecologies and hierarchies of multiple schedulers

o        Meta-schedulers

*         Hierarchical job scheduling with a meta-scheduler as the only
entry point; forwarding jobs to the meta-scheduler from other subsidiary
schedulers

o        Condor-style matchmaking

*                     Directory services

o        Using existing directory services

o        Abstract directory service interface(s)

*                     Data transfer topics

o        Application data staging

*         Naming

*         Efficiency

*         Convenience

*         Cleanup

o        Program staging/provisioning

*         Description

*         Installation

*         Cleanup

Marvin.

________________________________

From: Ian Foster [mailto:foster at mcs.anl.gov] 
Sent: Monday, February 20, 2006 9:20 AM
To: Marvin Theimer; ogsa-wg at ggf.org
Cc: Marvin Theimer; Savas Parastatidis; Tony Hey; Marty Humphrey;
gcf at grids.ucs.indiana.edu
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

Dear All:

The most important thing to understand at this point (IMHO) is the scope
of this "HPC use case," as this will determine just how minimal we can
be.

I get the impression that the principal goal may be "job submission to a
cluster." Is that correct? How do we start to circumscribe the scope
more explicitly?

Ian.

At 05:45 AM 2/16/2006 -0800, Marvin Theimer wrote:

Enclosed is a paper that advocates an additional set of activities that
the authors believe that the OGSA working groups should engage in.

Broadly speaking, the OGSA and related working groups are already doing
a bunch of important things:

*         There is broad exploration of the big picture, including
enumeration of use cases, taxonomy of areas, identification of research
issues, etc.

*         There is work going on in each of the horizontal areas that
have been identified, such as EMS, data services, etc.

*         There is working going around individual specifications, such
as BES, JSDL, etc.

Given that individual specifications are beginning to come to fruition,
the authors believe it is time to also start defining vertical
profilesthat precisely describe how groups of individual specifications
should be employed to implement specific use cases in an interoperable
manner.  The authors also believe that the process of defining these
profiles offers an opportunity to close the design loopby relating the
various on-going protocol and standards efforts back to the use cases in
a very concrete manner.  This provides an end-to-end setting in which to
identify holes and issues that might require additional protocols and/or
(incremental) changes to existing protocols.  The paper introduces both
the general notion of doing focused vertical design effortsand then
focuses on a specific vertical design effort, namely a minimal HPC
design.  

The paper derives a specific HPC design in a first principlesmanner
since the authors believe that this increases the chances of identifying
issues.  As a consequence, existing specifications and the activities of
existing working groups are not mentioned and this paper is not an
attempt to actually define a specifications profile.  Also, the absence
of references to existing work is not meant to imply that such work is
in any way irrelevant or inappropriate.  The paper should be viewed as a
first abstract attempt to propose a new kind of activity within OGSA.
The expectation is that future open discussions and publications will
explore the concrete details of such a proposal.

This paper was recently sent to a few key individuals in order to get
feedback from them before submitting it to the wider GGF community.
Unfortunately that process took longer than intended and some members of
the community may have already seen a copy of the paper without knowing
the context within it was written.  This email should hopefully dispel
any misconceptions that may have occurred.

For those people who will be around on for the F2F meetings on Friday,
Marvin Theimer will be giving a talk on the contents of this paper at a
time and place to be announced.

Marvin Theimer, Savas Parastatidis, Tony Hey, Marty Humphrey, Geoffrey
Fox

_______________________________________________________________
Ian Foster                    www.mcs.anl.gov/~foster
Math & Computer Science Div.  Dept of Computer Science
Argonne National Laboratory   The University of Chicago    
Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
Tel: 630 252 4619             Fax: 630 252 1997
        Globus Alliance, www.globus.org <http://www.globus.org/> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/ogsa-wg/attachments/20060313/86bfafcb/attachment.html