[ogsa-wg] Paper proposing "evolutionary vertical design efforts"

Tue Mar 14 08:31:47 CST 2006

Carl,

Your comments are very important. We would love to have your active
participation in this effort. Your experience is, of course, matched by few!

I re-emphasize that this represents (my words, not anyone else's) "baby
steps" that are necessary and important for the Grid community.  In my
opinion, the biggest challenge will be to fight the urge to expand the scope
beyond a small size. You cannot ignore the possibility that the GGF has NOT
made as much progress as it should have to date. Furthermore, one such
plausible explanation is that the scope is too grand. 

-- Marty

  _____  

From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of Carl
Kesselman
Sent: Tuesday, March 14, 2006 8:47 AM
To: Marvin Theimer; Ian Foster; ogsa-wg at ggf.org
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

Hi,

While I have no wish to engage in the "what is a Grid" argument, there are
some elements of your base use case that I would be concerned about.
Specifically, the assumption that the submission in into a "local cluster"
on which there is an existing account may lead one to a solution that may
not generalize to the solution to the case of submission across autonomous
policy domains.  I would also argue that ignoring issues of fault tolerance
from the beginning is also problematic.  One must at least design operations
that are restartable (for example at most once submission semantics).

I would finally suggest that while examining existing job schedule systems
is a good thing to do, we should also examine existing remote submission
systems (dare I say Grid systems).  The basic HPC use case is one in which
there is a significant amount implementation and usage experience.

Thanks,

Carl

  _____  

From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of
Marvin Theimer
Sent: Monday, March 13, 2006 2:42 PM
To: Ian Foster; ogsa-wg at ggf.org
Cc: Marvin Theimer
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

Hi;

Ian, you are correct that I view job submission to a cluster as being one of
the simplest, and hence most basic, HPC use cases to start with.  Or, to be
slightly more general, I view job submission to a "black box" that can run
jobs - be it a cluster or an SMP or an SGI NUMA machine or what-have-you -
as being the simplest and hence most basic HPC use case to start with.  The
key distinction for me is that the internals of the "box" are for the most
part not visible to the client, at least as far as submitting and running
compute jobs is concerned.  There may well be a separate interface for
dealing with things like system management, but I want to explicitly
separate those things out in order to allow for use of "boxes" that might be
managed by proprietary means or by means obeying standards that a particular
job submission client is unfamiliar with.

I think the use case that Ravi Subramaniam posted to this mailing list back
on 2/17 is a good one to start a discussion around.  However, I'd like to
present it from a different point-of-view than he did.  The manner in which
the use case is currently presented emphasizes all the capabilities and
services needed to handle the fully general case of submitting a batch job
to a computing utility/service.  That's a great way of producing a taxonomy
against which any given system or design can be compared to see what it has
to offer.  I would argue that the next step is to ask what's the simplest
subset that represents a useful system/design and how should one categorize
the various capabilities and services he has identified so as to arrive at
meaningful components that can be selectively used to obtain progressively
more capable systems.

Another useful exercise to do is to examine existing job scheduling systems
in order to understand what they provide.  Since in the real world we will
have to deal with the legacy of existing systems it will be important to
understand how they relate to the use cases we explore.  In the same vein,
it will be important to take into account and understand other existing
infrastructures that people use that are related to HPC use cases.  I'm
thinking of things like security infrastructures, directory services, and so
forth.  From the point-of-view of managing complexity and reducing
total-cost-of-ownership, it will be important to understand the extent to
which existing infrastructure and services can be reused rather than
reinvented.

To kick off a discussion around the topic of a minimalist HPC use case, I
present a straw man description of such below and then present a first
attempt at categorizing various areas of extension.  The categorization of
extension areas is not meant to be complete or even all that carefully
thought-out as far as componentization boundaries are concerned; it is
merely meant to be a first contribution to get the discussion going.

A basic HPC use case: Compute cluster embedded within an organization.

*     This is your basic batch job scheduling scenario.  Only a very basic
state transition diagram is visible to the client, with the following states
for a job: queued, running, finished.  Additional states -- and associated
state transition request operations and functionality -- are not supported.
Examples of additional states and associated functionality include
suspension of jobs and migration of jobs.

*     Only "standard" resources can be described, for example: number of
cpus/nodes needed, memory requirements, disk requirements, etc.  (think
resources that are describable by JSDL).

*     Once a job has been submitted it can be cancelled, but its resource
requests can't be modified.

*     A distributed file system is accessible from client desktop machines
and client file servers, as well as compute nodes of the compute cluster.
This implies that no data staging is required, that programs can be (for the
most part) executed from existing file system locations, and that no program
"provisioning" is required (since you can execute them from wherever they
are already installed).  Thus in this use case all data transfer and program
installation operations are the responsibility of the user.

*     Users already have accounts within the existing security
infrastructure (e.g. Kerberos).  They would like to use these and not have
to create/manage additional authentication/authorization credentials (at
least at the level that is visible to them).

*     The job scheduling service resides at a well-known network name and it
is aware of the compute cluster and its resources by "private" means (e.g.
it runs on the head node of the cluster and employs private means to monitor
and control the resources of the cluster).  This implies that there is no
need for any sort of directory services for finding the compute cluster or
the resources it represents other than basic DNS.

*     Compute cluster system management is opaque to users and is the
concern of the compute cluster's owners.  This implies that system
management is not part of the compute cluster's public job scheduling
interface.  This also implies that there is no need for a logging interface
to the service.  I assume that application-level logging can be done by
means of libraries that write to client files; i.e. that there is no need
for any sort of special system support for logging.

*     A simple polling-based interface is the simplest form of interface to
something like a job scheduling service.  However, a simple call-back
notification interface is a very useful addition that potentially provides
substantial performance benefits since it can enable the avoidance of lots
of unnecessary network traffic.  Only job state changes result in
notification messages.

*     There are no notions of fault tolerance.  Jobs that fail must be
resubmitted by the client.  Neither the cluster head node nor its compute
nodes are fault tolerant.  I do expect the client software to return an
indication of failure-due-system-fault when appropriate.  (Note that this
may also occur when things like network partitions occur.)

*     One does need some notion of how to deal with orphaned resources and
jobs.  The notion of job lifetime and post-expiration garbage collection is
a natural approach here.

*     The scheduling service provides a fixed set of scheduling policies,
with only a few basic choices (or maybe even just one), such as FIFO or
round-robin.  There is no notion, in general, of SLAs (which are a form of
scheduling policy).

*     Enough information must be returned to the client when a job finishes
to enable basic accounting functionality.  This means things like total
wall-clock time the job ran and a summary of resources used.  There is not a
need for the interface to support any sort of grouping of accounting
information.  That is, jobs do not need to be associated with projects,
groups, or other accounting entities and the job scheduling service is not
responsible for tracking accounting information across such entities.  As
long as basic resource utilization information is returnable for each job,
accounting can be done externally to the job scheduling service.  I do
assume that jobs can be uniquely identified by some means and can be
uniquely associated with some principal entity existing in the overall
system, such as a user name.

*     Just as there is no notion of requiring the job scheduling service to
track any but the most basic job-level accounting information, there is no
notion of the service enforcing quotas on jobs.

*     Although it is generally useful to separate the notions of resource
reservation from resource usage (e.g. to enable interactive and debugging
use of resources), it is not a necessity for the most basic of job
scheduling services.  

*     There is no notion of tying multiple jobs together, either to support
things like dependency graphs or to support things like workflows.  Such
capabilities must be implemented by clients of the job scheduling service.

Interesting extension areas:

*      Additional scheduling policies

o     Weighted fair-share, .

o     Multiple queues

o     SLAs

o     ...

*      Extended resource descriptions

o     Additional resource types, such as GPUs

o     Additional types of compute resources, such as desktop computers

o     Condor-style class ads

*      Extended job descriptions (as returned to requesting clients and sys
admins)

*      Additional classes of security credentials

*      Reservations separated from execution

o     Enabling interactive and debugging jobs

o     Support for multiple competing schedulers (incl. desktop cycle
stealing and market-based approaches to scheduling compute resources)

*      Ability to modify jobs during their existence

*      Fault tolerance

o     Automatic rescheduling of jobs that failed due to system faults

o     Highly available resources:  This is partly a policy statement by a
scheduling service about its characteristics and partly the ability to
rebind clients to migrated service endpoints

*      Extended state transition diagrams and associated functionalities

o     Job suspension

o     Job migration

o     .

*      Accounting & quotas

*      Operating on arrays of jobs

*      Meta-schedulers, multiple schedulers, and ecologies and hierarchies
of multiple schedulers

o     Meta-schedulers

*      Hierarchical job scheduling with a meta-scheduler as the only entry
point; forwarding jobs to the meta-scheduler from other subsidiary
schedulers

o     Condor-style matchmaking

*      Directory services

o     Using existing directory services

o     Abstract directory service interface(s)

*      Data transfer topics

o     Application data staging

*      Naming

*      Efficiency

*      Convenience

*      Cleanup

o     Program staging/provisioning

*      Description

*      Installation

*      Cleanup

Marvin.

  _____  

From: Ian Foster [mailto:foster at mcs.anl.gov] 
Sent: Monday, February 20, 2006 9:20 AM
To: Marvin Theimer; ogsa-wg at ggf.org
Cc: Marvin Theimer; Savas Parastatidis; Tony Hey; Marty Humphrey;
gcf at grids.ucs.indiana.edu
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

Dear All:

The most important thing to understand at this point (IMHO) is the scope of
this "HPC use case," as this will determine just how minimal we can be.

I get the impression that the principal goal may be "job submission to a
cluster." Is that correct? How do we start to circumscribe the scope more
explicitly?

Ian.

At 05:45 AM 2/16/2006 -0800, Marvin Theimer wrote:

Enclosed is a paper that advocates an additional set of activities that the
authors believe that the OGSA working groups should engage in.

Broadly speaking, the OGSA and related working groups are already doing a
bunch of important things:

*         There is broad exploration of the big picture, including
enumeration of use cases, taxonomy of areas, identification of research
issues, etc.

*         There is work going on in each of the horizontal areas that have
been identified, such as EMS, data services, etc.

*         There is working going around individual specifications, such as
BES, JSDL, etc.

Given that individual specifications are beginning to come to fruition, the
authors believe it is time to also start defining vertical profilesthat
precisely describe how groups of individual specifications should be
employed to implement specific use cases in an interoperable manner.  The
authors also believe that the process of defining these profiles offers an
opportunity to close the design loopby relating the various on-going
protocol and standards efforts back to the use cases in a very concrete
manner.  This provides an end-to-end setting in which to identify holes and
issues that might require additional protocols and/or (incremental) changes
to existing protocols.  The paper introduces both the general notion of
doing focused vertical design effortsand then focuses on a specific vertical
design effort, namely a minimal HPC design.  

The paper derives a specific HPC design in a first principlesmanner since
the authors believe that this increases the chances of identifying issues.
As a consequence, existing specifications and the activities of existing
working groups are not mentioned and this paper is not an attempt to
actually define a specifications profile.  Also, the absence of references
to existing work is not meant to imply that such work is in any way
irrelevant or inappropriate.  The paper should be viewed as a first abstract
attempt to propose a new kind of activity within OGSA.  The expectation is
that future open discussions and publications will explore the concrete
details of such a proposal.

This paper was recently sent to a few key individuals in order to get
feedback from them before submitting it to the wider GGF community.
Unfortunately that process took longer than intended and some members of the
community may have already seen a copy of the paper without knowing the
context within it was written.  This email should hopefully dispel any
misconceptions that may have occurred.

For those people who will be around on for the F2F meetings on Friday,
Marvin Theimer will be giving a talk on the contents of this paper at a time
and place to be announced.

Marvin Theimer, Savas Parastatidis, Tony Hey, Marty Humphrey, Geoffrey Fox

_______________________________________________________________
Ian Foster                    www.mcs.anl.gov/~foster
Math & Computer Science Div.  Dept of Computer Science
Argonne National Laboratory   The University of Chicago    
Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
Tel: 630 252 4619             Fax: 630 252 1997
        Globus Alliance, www.globus.org <http://www.globus.org/> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/ogsa-wg/attachments/20060314/a1bb7ca7/attachment.html