[ogsa-wg] Paper proposing "evolutionary vertical design efforts"

Marty Humphrey humphrey at cs.virginia.edu
Tue Mar 21 12:43:44 CST 2006


But this is not so simple. The knee-jerk reaction is to separate these two
concerns into implementation vs. interface, and develop each one
independently. But taken to the extreme, a system that "appears" to be rich
in its capabilities might not be so in reality for some time (if EVER!). 

 

Let's assume that we truly separate these concerns and build "sophisticated
interfaces. But then what about the potential consumer of such services?
Building an overly complex interface to such a service (without any
practical implementations behind it) might promote further complicated
clients (which promotes further complexity upstream.) "Build the interface
and they will come with implementations" is a variation on a theme that
doesn't always come true. Arguably, complexity is what we're trying to get
away from.

 

And no, I'm not advocating only an interface that matches existing
capabilities. I'm just saying that it's NOT obvious that the most effective
approach is to entirely decouple these two concerns.

 

-- Marty

 

  _____  

From: Ian Foster [mailto:foster at mcs.anl.gov] 
Sent: Tuesday, March 21, 2006 1:34 PM
To: Marvin Theimer; Carl Kesselman
Cc: humphrey at cs.virginia.edu; ogsa-wg at ggf.org; Marvin Theimer
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

 

Marvin:

I think you are mixing two things together: the capabilities of the
scheduler and the capabilities of the remote submission interface. The
proposal that we submit at-most-once submission capabilities is a proposal
for capabilities in the remote submission interface, not the scheduler. I
wouldn't expect existing schedulers to provide this capability, just as they
don't (for the most part) support Web Services interfaces. But once we
define a Web Services-based remote submission interface, at-most-once
submission capabilities become important.

Ian.


At 10:28 AM 3/21/2006 -0800, Marvin Theimer wrote:




Hi;

 

Whereas I agree with you that at-most-once semantics are very desirable, I
would like to point out that not all existing job schedulers implement them.
I know that both LSF and CCS (the Microsoft HPC job scheduler) dont.  Ive
been trying to find out whether PBS and SGE do or dont.  

 

So, this brings up the following slightly more general question: should the
simplest base case be the simplest case that does something useful, or
should it be more complicated than that?  I can see good arguments on both
sides:

*         Whittling things down to the simplest possible base case maximizes
the likelihood that parties can participate.  Every feature added represents
one more feature that some existing system may not be able to support or
that a new system has to provide even when its not needed in the context of
that system.  Suppose, for example, that PBS and SGE dont provide
transactional semantics of the type you described.  Then 4 of the 6 most
common job scheduling systems would not have this feature and would need to
somehow add it to their implementations.  In this particular case it might
be too difficult to add in practice, but in general there might be problems.


*         On the other hand, since there are many clients and arguably far
fewer server implementations, features that substantially simplify client
behavior/programming and that are not too onerous to implement in existing
and future systems should be part of the base case.  The problem, of course,
is that this is a slippery slope at the end of which lies the number 42
(ignore that last phrase if youre not a fan of The Hitchhikers Guide to the
Galaxy).

 

Personally, the slippery slope argument makes me lean towards defining the
simplest possible base use case, since otherwise well spend a (potentially
very) long time arguing about which features are important enough to justify
being in the base case.  One possible way forward on this issue is to have
people come up with lists of features that they feel belong in the base use
case and then we agree to include only those that have a large majority of
the community arguing for their inclusion in the base case.  

 

Unfortunately defining what large majorityshould be is also not easy or
obvious.  Indeed, one can argue that we cant even afford to let all votes be
equal.  Consider the following hypothetical (and contrived) case: 100
members of a particular academic research community show up and vote that
the base case must include support for a particular complicated scheduling
policy and the less-than-ten suppliers of existing job schedulers with
significant numbers of users all vote against it.  Should it be included in
the base case?  What happens if the major scheduler vendors/suppliers decide
that they cant justify implementing it and therefore cant be GGF
spec-compliant and therefore go off and define their own job scheduling
standard?  The hidden issue is, of course, whether those voting are
representative of the overall HPC user population.  I cant personally answer
that question, but it does again lead me to want to minimize the number of
times I have to ask that question i.e. the number of features that I have to
consider for inclusion in the base case.

 

So this brings me to the question of next steps.  Recall that the approach
Im advocating and that others have bought in to as far as I can tell is that
we define a base case and the mechanisms and approach to how extensions of
the base case are done.  I assert that the absolutely most important part of
defining how extension should work is ensuring that multiple extensions dont
end up producing a hairball thats impossible to understand, implement, or
use.  In practice this means coming up with a restricted form of extension
since history is pretty clear on the pitfalls of trying to support
arbitrarily general extension schemes.  

 

This is one of the places where identification of common use cases comes in.
If we define the use cases that we think might actually occur then we can
ask whether a given approach to extension has a plausible way of achieving
all the identified use cases.  Of course, future desired use cases might not
be achievable by the extension schemes we come up with now, but that
possibility is inevitable given anything less than a fully general extension
scheme.  Indeed, even among the common use cases we identify now, we might
discover that there are trade-offs where a simpler (and hence probably more
understandable and easier to implement and use) extension scheme can cover
80% of the use cases while a much more complicated scheme is required to
cover 100% of the use cases.

 

Given all this, here are the concrete next steps Id like to propose:

*         Everyone who is participating in this design effort should define
what they feel should be the HPC base use case.  This represents the
simplest use case and associated features like transactional submit
semantics that you feel everyone in the HPC grid world must implement.  We
will take these use case candidates and debate which one to actually settle
on.

*         Everyone should define the set of HPC use cases that they believe
might actually occur in practice.  I will refer to these as the common use
cases, in contrast to the base use case.  The goal here is not to define the
most general HPC use case, but rather the more restricted use cases that
might occur in real life.  For example, not all systems will support job
migration, so whereas a fully general HPC use case would include the notion
of job migration, I argue that one or more common use cases will not include
job migration.

Everyone should also prioritize and rank their common use cases so that we
can discuss 80/20-style trade-offs concerning which use cases to support
with any given approach to extension.  Thus prioritization should include
the notion of how common you think a use case will actually be, and hence
how important it will be to actually support that use case.

*         Everyone should start thinking about what kinds of extension
approaches they believe we should define, given the base use case and common
use cases that they have identified.

 

As multiple people have pointed out, an exploration of common HPC use cases
has already been done one or several times before, including in the EMS
working group.  Im still catching up on reading GGF documents, so I dont
know how much those prior efforts explored the issue from the point-of-view
of base case plus extensions.  If these prior explorations did address the
topic of base-plus-extensions and you agree with the specifics that were
arrived at then this exercise will be a quick-and-easy one for you: you can
simply publish the appropriate links to prior material in an email to this
mailing list.  I will personally be sending in my list independent of prior
efforts in order to provide a newcomersperspective on the subject.  It will
interesting to see how much overlap there is.

 

One very important point that Id like to raise is the following: Time is
short and bestis the enemy of good enough.  Microsoft is planning to provide
a Web services-based interoperability interface to its job scheduler
sometime in the next year or two.  I know that many of the other job
scheduler vendors/suppliers are also interested in having an
interoperability story in place sooner rather than later.  To meet this
schedule on the Microsoft side will require locking down a first fairly
complete draft of whatever design will be shipped by essentially the end of
August.  That's so that we can do all the necessary debugging,
interoperability testing, security threat modeling, etc. that goes with
shipping an actual finished product.  What that means for the HPC profile
work is that, come the end of August, Microsoft and possibly other scheduler
vendors/suppliers will need to lock down and start coding some version of
Web Services-based job scheduling and data transfer protocols.  If there is
a fairly well-defined, feasible set of specs/profile coming out of the GGF
HPC working group (for recommendation NOT yet for actual standards approval)
that has some reasonable level of consensus by then, then that's what
Microsoft will very likely go with.  Otherwise Microsoft will need to defer
the idea of shipping anything that might be GGF compliant to version 3 of
our product, which will probably ship about 4 years from now.

 

The chances of coming up with the bestHPC profile by the end of August are
slim.  The chances of coming up with a fairly simple design that is good
enoughto cover the most important common cases by means of a relatively
simple, restricted form of extension seems much more feasible.  Covering a
richer set of use cases would need to be deferred to a future version of the
profile, much in the manner that BES has been defined to cover an important
sub-category of use cases now, with a fuller EMS design being done in
parallel as future work.  So I would argue that perhaps the most important
thing this design effort and the planned HPC profile working group that will
be set up in Tokyo can do is to identify what a good enoughversion 1 HPC
profile should be.

 

Marvin.

 

 

  _____  

From: Carl Kesselman [mailto:carl at isi.edu] 
Sent: Thursday, March 16, 2006 12:49 AM
To: Marvin Theimer
Cc: humphrey at cs.virginia.edu; ogsa-wg at ggf.org
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

 

Hi,

In the interest of furthering agreement, I was not arguing that the
application had to be restartable. Rather, what has been shown to be
important is that the protocol be restartable in the following sense:  if
you submit a job and the far and server fails, is the job running or not, if
you resubmit, do you get another job instance. The GT sumbission protocol
and Condor have a transactional semantics so that you can have at most once
submit semantics reegardless of client and server failures. The fact that
your application may be non-itempote is exactly why having a well defined
semantics in this case is important.

So what is the next step?

Carl

Dr. Carl Kesselman                              email:   carl at isi.edu
USC/Information Sciences Institute        WWW: http://www.isi.edu/~carl
4676 Admiralty Way, Suite 1001           Phone:  (310) 448-9338
Marina del Rey, CA 90292-6695            Fax:      (310) 823-6714



-----Original Message-----
From: Marvin Theimer <theimer at microsoft.com>
To: Carl Kesselman <carl at isi.edu>
CC: Marvin Theimer <theimer at microsoft.com>; Marty Humphrey
<humphrey at cs.virginia.edu>; ogsa-wg at ggf.org <ogsa-wg at ggf.org>
Sent: Wed Mar 15 14:26:36 2006
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

Hi;



I suspect that were mostly in agreement on things.  In particular, I think
your list of four core aspects is a great starting point for a discussion on
the topic.



I just replied to an earlier email from Ravi with a description of what Im
hoping to get out of examining various HPC use cases:

.        Identification of the simplest base case that everyone will have to
implement.

.        Identification of common cases we want to optimize.

.        Identification of how evolution and selective extension will work.



I totally agree with you that the base use case I described isnt really a
griduse case.  But it is an HPC use case in fact it is arguably the most
common use case in current existence. J  So I think its important that we
understand how to seamlessly integrate and support that common and very
simple use case.



I also totally agree with you that we cant let a solution to the simplest
HPC use case paint us into a corner that prevents supporting the richer use
cases that grid computing is all about.  Thats why Id like to spend
significant effort exploring and understanding the issues of how to support
evolution and selective extension.  In an ideal world a legacy compute
cluster job scheduler could have a simple grid shimthat let it participate
at a basic level, in a natural manner, in a grid environment, while smarter
clients and HPC services could interoperate with each other in various
selectively richer manners by means of extensions to the basic HPC grid
design.



One place where I disagree with you is your assertion that everything needs
to be designed to be restartable.  While thats a good goal to pursue Im not
convinced that you can achieve it in all cases.  In particular, there are at
least two cases that I claim we want to support that arent restartable:

.        We want to be able to run applications that arent restartable; for
example, because they perform non-idempotent operations on the external
physical environment.  If such an application fails during execution then
the only one who can figure out what the proper next steps are is the end
user.

.        We want to be able to include (often-times legacy) systems that
arent fault tolerant, such as simple small compute clusters where the owners
didnt think that fault tolerance was worth paying for.

Of course any acceptable design will have to enable systems that are fault
tolerant to export/expose that capability.  To my mind its more a matter of
ensuring that non-fault-tolerant systems arent excluded from participation
in a grid.



Other things we agree on:

.        We should certainly examine what remote job submission systems do.
We should certainly look at existing systems like Globus, Unicore, and
Legion.  In general, we should be looking at everything that has any actual
experience that we can learn from and everything that is actually deployed
and hence represents a system that we potentially need to interoperate with.
(Whether a final design is actually able to interoperate at any but the most
basic level with various exotic existing systems is a separate issue.)

.        We should absolutely focus on codifying what we know how to do and
avoid doing research as part of a standards process.  I believe that
thinking carefully about how to support evolution and extension is our best
hope for allowing people to defer trying to bake their pet research topic
into standards since it provides a story for why todays standards dont
preclude tomorrows improvements.



So I would propose that next steps are:

.        Continue to explore and classify various HPC use cases of various
differing levels of complexity.

.        Describe the requirements and limitations of existing job
scheduling and remote job submission systems.

.        Continue identifying and discussing key featuresof use cases and
potential design solutions, such as the four that you identified in your
last email.



Marvin.



________________________________

From: Carl Kesselman [mailto:carl at isi.edu]
Sent: Tuesday, March 14, 2006 7:50 AM
To: Marty Humphrey; ogsa-wg at ggf.org
Cc: Marvin Theimer
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"



Hi,



Just to be clear, Im not trying to suggest that the scope be expanded. I
agree with the approach of focusing on a baby step is a good one, and many
of the assumptions stated in Marvins list I am in total agreement with.
However, in taking baby steps I think that it is important that we end up
walking, and that in defining the use case, one can easily create solutions
that will not get you to the next step. This is my point about looking at
what we know how to do and have been doing in production settings for many
years now. In my mind, one of the scope grandness problems has been that
there has been far too little focus on codifying what we know how to do in
favor of using a standards process as an excuse to design new things.  So at
the risk of sounding partisan, the simplified use case that Marvin is
proposing is exactly the use case that GRAM has been doing for over ten
years now (I think the same can be said about UNICORE and Legion).



So let me try to be  constructive.  One of the things that falls out of
Marvins list could be a set of basic concepts/operations that need to be
defined.  These include:

1) A way of describing localjob configuration, i.e. where to find the
executable, data files, etc. This should be very conservative with its
assumptions on shared file systems and accessibility. In general, what needs
to be stated here are what are the underlying aspects of the underlying
resource that are exposed to the outward facing interface.

2) A way of naming a submission point (should probably have a way of
modeling queues).

3) A core set of job management operations, submit, status, kill. These need
to be defined in such a way at to be tolerate to a variety of failure
scenarios, in that the state needs to be well defined in the case of
failure.

4) A state model that one can use to describe what is going on with the jobs
and a way to access that state.  Can be simple (queued, running, done), may
need to be extensible.  One can view the accounting information as being
exposed



So, one thing to do would be to agree that these are (or are not) the right
four things that need to be defined and if so, start to flesh out these in a
way that supports the core use case but doesnt introduce assumptions that
would preclude more complex use cases in the future.





Carl



________________________________

From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of
Marty Humphrey
Sent: Tuesday, March 14, 2006 6:32 AM
To: ogsa-wg at ggf.org
Cc: 'Marvin Theimer'
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"



Carl,



Your comments are very important. We would love to have your active
participation in this effort. Your experience is, of course, matched by few!



I re-emphasize that this represents (my words, not anyone elses) baby
stepsthat are necessary and important for the Grid community.  In my
opinion, the biggest challenge will be to fight the urge to expand the scope
beyond a small size. You cannot ignore the possibility that the GGF has NOT
made as much progress as it should have to date. Furthermore, one such
plausible explanation is that the scope is too grand.



-- Marty





________________________________

From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of Carl
Kesselman
Sent: Tuesday, March 14, 2006 8:47 AM
To: Marvin Theimer; Ian Foster; ogsa-wg at ggf.org
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"



Hi,



While I have no wish to engage in the what is a Gridargument, there are some
elements of your base use case that I would be concerned about.
Specifically, the assumption that the submission in into a local clusteron
which there is an existing account may lead one to a solution that may not
generalize to the solution to the case of submission across autonomous
policy domains.  I would also argue that ignoring issues of fault tolerance
from the beginning is also problematic.  One must at least design operations
that are restartable (for example at most once submission semantics).



I would finally suggest that while examining existing job schedule systems
is a good thing to do, we should also examine existing remote submission
systems (dare I say Grid systems).  The basic HPC use case is one in which
there is a significant amount implementation and usage experience.



Thanks,


Carl





________________________________

From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of
Marvin Theimer
Sent: Monday, March 13, 2006 2:42 PM
To: Ian Foster; ogsa-wg at ggf.org
Cc: Marvin Theimer
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"



Hi;



Ian, you are correct that I view job submission to a cluster as being one of
the simplest, and hence most basic, HPC use cases to start with.  Or, to be
slightly more general, I view job submission to a black boxthat can run jobs
be it a cluster or an SMP or an SGI NUMA machine or what-have-you as being
the simplest and hence most basic HPC use case to start with.  The key
distinction for me is that the internals of the boxare for the most part not
visible to the client, at least as far as submitting and running compute
jobs is concerned.  There may well be a separate interface for dealing with
things like system management, but I want to explicitly separate those
things out in order to allow for use of boxesthat might be managed by
proprietary means or by means obeying standards that a particular job
submission client is unfamiliar with.



I think the use case that Ravi Subramaniam posted to this mailing list back
on 2/17 is a good one to start a discussion around.  However, Id like to
present it from a different point-of-view than he did.  The manner in which
the use case is currently presented emphasizes all the capabilities and
services needed to handle the fully general case of submitting a batch job
to a computing utility/service.  Thats a great way of producing a taxonomy
against which any given system or design can be compared to see what it has
to offer.  I would argue that the next step is to ask whats the simplest
subset that represents a useful system/design and how should one categorize
the various capabilities and services he has identified so as to arrive at
meaningful components that can be selectively used to obtain progressively
more capable systems.



Another useful exercise to do is to examine existing job scheduling systems
in order to understand what they provide.  Since in the real world we will
have to deal with the legacy of existing systems it will be important to
understand how they relate to the use cases we explore.  In the same vein,
it will be important to take into account and understand other existing
infrastructures that people use that are related to HPC use cases.  Im
thinking of things like security infrastructures, directory services, and so
forth.  From the point-of-view of managing complexity and reducing
total-cost-of-ownership, it will be important to understand the extent to
which existing infrastructure and services can be reused rather than
reinvented.



To kick off a discussion around the topic of a minimalist HPC use case, I
present a straw man description of such below and then present a first
attempt at categorizing various areas of extension.  The categorization of
extension areas is not meant to be complete or even all that carefully
thought-out as far as componentization boundaries are concerned; it is
merely meant to be a first contribution to get the discussion going.



A basic HPC use case: Compute cluster embedded within an organization.

.     This is your basic batch job scheduling scenario.  Only a very basic
state transition diagram is visible to the client, with the following states
for a job: queued, running, finished.  Additional states -- and associated
state transition request operations and functionality -- are not supported.
Examples of additional states and associated functionality include
suspension of jobs and migration of jobs.

.     Only "standard" resources can be described, for example: number of
cpus/nodes needed, memory requirements, disk requirements, etc.  (think
resources that are describable by JSDL).

.     Once a job has been submitted it can be cancelled, but its resource
requests can't be modified.

.     A distributed file system is accessible from client desktop machines
and client file servers, as well as compute nodes of the compute cluster.
This implies that no data staging is required, that programs can be (for the
most part) executed from existing file system locations, and that no program
"provisioning" is required (since you can execute them from wherever they
are already installed).  Thus in this use case all data transfer and program
installation operations are the responsibility of the user.

.     Users already have accounts within the existing security
infrastructure (e.g. Kerberos).  They would like to use these and not have
to create/manage additional authentication/authorization credentials (at
least at the level that is visible to them).

.     The job scheduling service resides at a well-known network name and it
is aware of the compute cluster and its resources by "private" means (e.g.
it runs on the head node of the cluster and employs private means to monitor
and control the resources of the cluster).  This implies that there is no
need for any sort of directory services for finding the compute cluster or
the resources it represents other than basic DNS.

.     Compute cluster system management is opaque to users and is the
concern of the compute cluster's owners.  This implies that system
management is not part of the compute cluster's public job scheduling
interface.  This also implies that there is no need for a logging interface
to the service.  I assume that application-level logging can be done by
means of libraries that write to client files; i.e. that there is no need
for any sort of special system support for logging.

.     A simple polling-based interface is the simplest form of interface to
something like a job scheduling service.  However, a simple call-back
notification interface is a very useful addition that potentially provides
substantial performance benefits since it can enable the avoidance of lots
of unnecessary network traffic.  Only job state changes result in
notification messages.

.     There are no notions of fault tolerance.  Jobs that fail must be
resubmitted by the client.  Neither the cluster head node nor its compute
nodes are fault tolerant.  I do expect the client software to return an
indication of failure-due-system-fault when appropriate.  (Note that this
may also occur when things like network partitions occur.)

.     One does need some notion of how to deal with orphaned resources and
jobs.  The notion of job lifetime and post-expiration garbage collection is
a natural approach here.

.     The scheduling service provides a fixed set of scheduling policies,
with only a few basic choices (or maybe even just one), such as FIFO or
round-robin.  There is no notion, in general, of SLAs (which are a form of
scheduling policy).

.     Enough information must be returned to the client when a job finishes
to enable basic accounting functionality.  This means things like total
wall-clock time the job ran and a summary of resources used.  There is not a
need for the interface to support any sort of grouping of accounting
information.  That is, jobs do not need to be associated with projects,
groups, or other accounting entities and the job scheduling service is not
responsible for tracking accounting information across such entities.  As
long as basic resource utilization information is returnable for each job,
accounting can be done externally to the job scheduling service.  I do
assume that jobs can be uniquely identified by some means and can be
uniquely associated with some principal entity existing in the overall
system, such as a user name.

.     Just as there is no notion of requiring the job scheduling service to
track any but the most basic job-level accounting information, there is no
notion of the service enforcing quotas on jobs.

.     Although it is generally useful to separate the notions of resource
reservation from resource usage (e.g. to enable interactive and debugging
use of resources), it is not a necessity for the most basic of job
scheduling services. 

.     There is no notion of tying multiple jobs together, either to support
things like dependency graphs or to support things like workflows.  Such
capabilities must be implemented by clients of the job scheduling service.



Interesting extension areas:

.      Additional scheduling policies

o     Weighted fair-share, &

o     Multiple queues

o     SLAs

o     ...

.      Extended resource descriptions

o     Additional resource types, such as GPUs

o     Additional types of compute resources, such as desktop computers

o     Condor-style class ads

.      Extended job descriptions (as returned to requesting clients and sys
admins)

.      Additional classes of security credentials

.      Reservations separated from execution

o     Enabling interactive and debugging jobs

o     Support for multiple competing schedulers (incl. desktop cycle
stealing and market-based approaches to scheduling compute resources)

.      Ability to modify jobs during their existence

.      Fault tolerance

o     Automatic rescheduling of jobs that failed due to system faults

o     Highly available resources:  This is partly a policy statement by a
scheduling service about its characteristics and partly the ability to
rebind clients to migrated service endpoints

.      Extended state transition diagrams and associated functionalities

o     Job suspension

o     Job migration

o     &

.      Accounting & quotas

.      Operating on arrays of jobs

.      Meta-schedulers, multiple schedulers, and ecologies and hierarchies
of multiple schedulers

o     Meta-schedulers

.      Hierarchical job scheduling with a meta-scheduler as the only entry
point; forwarding jobs to the meta-scheduler from other subsidiary
schedulers

o     Condor-style matchmaking

.      Directory services

o     Using existing directory services

o     Abstract directory service interface(s)

.      Data transfer topics

o     Application data staging

.      Naming

.      Efficiency

.      Convenience

.      Cleanup

o     Program staging/provisioning

.      Description

.      Installation

.      Cleanup





Marvin.



________________________________

From: Ian Foster [mailto:foster at mcs.anl.gov]
Sent: Monday, February 20, 2006 9:20 AM
To: Marvin Theimer; ogsa-wg at ggf.org
Cc: Marvin Theimer; Savas Parastatidis; Tony Hey; Marty Humphrey;
gcf at grids.ucs.indiana.edu
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"



Dear All:

The most important thing to understand at this point (IMHO) is the scope of
this "HPC use case," as this will determine just how minimal we can be.

I get the impression that the principal goal may be "job submission to a
cluster." Is that correct? How do we start to circumscribe the scope more
explicitly?

Ian.



At 05:45 AM 2/16/2006 -0800, Marvin Theimer wrote:

Enclosed is a paper that advocates an additional set of activities that the
authors believe that the OGSA working groups should engage in.



Broadly speaking, the OGSA and related working groups are already doing a
bunch of important things:

.         There is broad exploration of the big picture, including
enumeration of use cases, taxonomy of areas, identification of research
issues, etc.

.         There is work going on in each of the horizontal areas that have
been identified, such as EMS, data services, etc.

.         There is working going around individual specifications, such as
BES, JSDL, etc.



Given that individual specifications are beginning to come to fruition, the
authors believe it is time to also start defining vertical profilesthat
precisely describe how groups of individual specifications should be
employed to implement specific use cases in an interoperable manner.  The
authors also believe that the process of defining these profiles offers an
opportunity to close the design loopby relating the various on-going
protocol and standards efforts back to the use cases in a very concrete
manner.  This provides an end-to-end setting in which to identify holes and
issues that might require additional protocols and/or (incremental) changes
to existing protocols.  The paper introduces both the general notion of
doing focused vertical design effortsand then focuses on a specific vertical
design effort, namely a minimal HPC design. 



The paper derives a specific HPC design in a first principlesmanner since
the authors believe that this increases the chances of identifying issues.
As a consequence, existing specifications and the activities of existing
working groups are not mentioned and this paper is not an attempt to
actually define a specifications profile.  Also, the absence of references
to existing work is not meant to imply that such work is in any way
irrelevant or inappropriate.  The paper should be viewed as a first abstract
attempt to propose a new kind of activity within OGSA.  The expectation is
that future open discussions and publications will explore the concrete
details of such a proposal.



This paper was recently sent to a few key individuals in order to get
feedback from them before submitting it to the wider GGF community.
Unfortunately that process took longer than intended and some members of the
community may have already seen a copy of the paper without knowing the
context within it was written.  This email should hopefully dispel any
misconceptions that may have occurred.



For those people who will be around on for the F2F meetings on Friday,
Marvin Theimer will be giving a talk on the contents of this paper at a time
and place to be announced.



Marvin Theimer, Savas Parastatidis, Tony Hey, Marty Humphrey, Geoffrey Fox



_______________________________________________________________
Ian Foster                    www.mcs.anl.gov/~foster
Math & Computer Science Div.  Dept of Computer Science
Argonne National Laboratory   The University of Chicago   
Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
Tel: 630 252 4619             Fax: 630 252 1997
        Globus Alliance, www.globus.org <http://www.globus.org/>
<http://www.globus.org/>

_______________________________________________________________
Ian Foster                    www.mcs.anl.gov/~foster
Math & Computer Science Div.  Dept of Computer Science
Argonne National Laboratory   The University of Chicago    
Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
Tel: 630 252 4619             Fax: 630 252 1997
        Globus Alliance, www.globus.org <http://www.globus.org/> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/ogsa-wg/attachments/20060321/82317188/attachment.htm 


More information about the ogsa-wg mailing list