[ogsa-wg] HPC profile: A revised base HPC use case plus a list of common use cases
Hiro Kishimoto
hiro.kishimoto at jp.fujitsu.com
Sun Apr 2 04:09:12 CDT 2006
Thanks Marvin,
I've upload it to the F2F meeting folder as well as HPC profile
design team folder (by hardlink).
https://forge.gridforum.org/projects/ogsa-wg/document/HPC_Base_and_Common_Use_Cases
See you very soon.
----
Hiro Kishimoto
Marvin Theimer wrote:
> Hi;
>
>
>
> Enclosed is a document in which I present a revised HPC base use case as
> well as the common use cases that I would like to propose.
>
>
>
> Marvin.
>
>
>
> ________________________________
>
> From: Marvin Theimer
> Sent: Tuesday, March 21, 2006 10:29 AM
> To: Carl Kesselman
> Cc: humphrey at cs.virginia.edu; ogsa-wg at ggf.org; Marvin Theimer
> Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
>
>
> Hi;
>
>
>
> Whereas I agree with you that at-most-once semantics are very desirable,
> I would like to point out that not all existing job schedulers implement
> them. I know that both LSF and CCS (the Microsoft HPC job scheduler)
> don't. I've been trying to find out whether PBS and SGE do or don't.
>
>
>
> So, this brings up the following slightly more general question: should
> the simplest base case be the simplest case that does something useful,
> or should it be more complicated than that? I can see good arguments on
> both sides:
>
> * Whittling things down to the simplest possible base case
> maximizes the likelihood that parties can participate. Every feature
> added represents one more feature that some existing system may not be
> able to support or that a new system has to provide even when it's not
> needed in the context of that system. Suppose, for example, that PBS
> and SGE don't provide transactional semantics of the type you described.
> Then 4 of the 6 most common job scheduling systems would not have this
> feature and would need to somehow add it to their implementations. In
> this particular case it might be too difficult to add in practice, but
> in general there might be problems.
>
> * On the other hand, since there are many clients and arguably
> far fewer server implementations, features that substantially simplify
> client behavior/programming and that are not too onerous to implement in
> existing and future systems should be part of the base case. The
> problem, of course, is that this is a slippery slope at the end of which
> lies the number 42 (ignore that last phrase if you're not a fan of The
> Hitchhiker's Guide to the Galaxy).
>
>
>
> Personally, the slippery slope argument makes me lean towards defining
> the simplest possible base use case, since otherwise we'll spend a
> (potentially very) long time arguing about which features are important
> enough to justify being in the base case. One possible way forward on
> this issue is to have people come up with lists of features that they
> feel belong in the base use case and then we agree to include only those
> that have a large majority of the community arguing for their inclusion
> in the base case.
>
>
>
> Unfortunately defining what "large majority" should be is also not easy
> or obvious. Indeed, one can argue that we can't even afford to let all
> votes be equal. Consider the following hypothetical (and contrived)
> case: 100 members of a particular academic research community show up
> and vote that the base case must include support for a particular
> complicated scheduling policy and the less-than-ten suppliers of
> existing job schedulers with significant numbers of users all vote
> against it. Should it be included in the base case? What happens if
> the major scheduler vendors/suppliers decide that they can't justify
> implementing it and therefore can't be GGF spec-compliant and therefore
> go off and define their own job scheduling standard? The hidden issue
> is, of course, whether those voting are representative of the overall
> HPC user population. I can't personally answer that question, but it
> does again lead me to want to minimize the number of times I have to ask
> that question - i.e. the number of features that I have to consider for
> inclusion in the base case.
>
>
>
> So this brings me to the question of next steps. Recall that the
> approach I'm advocating - and that others have bought in to as far as I
> can tell - is that we define a base case and the mechanisms and approach
> to how extensions of the base case are done. I assert that the
> absolutely most important part of defining how extension should work is
> ensuring that multiple extensions don't end up producing a hairball
> that's impossible to understand, implement, or use. In practice this
> means coming up with a restricted form of extension since history is
> pretty clear on the pitfalls of trying to support arbitrarily general
> extension schemes.
>
>
>
> This is one of the places where identification of common use cases comes
> in. If we define the use cases that we think might actually occur then
> we can ask whether a given approach to extension has a plausible way of
> achieving all the identified use cases. Of course, future desired use
> cases might not be achievable by the extension schemes we come up with
> now, but that possibility is inevitable given anything less than a fully
> general extension scheme. Indeed, even among the common use cases we
> identify now, we might discover that there are trade-offs where a
> simpler (and hence probably more understandable and easier to implement
> and use) extension scheme can cover 80% of the use cases while a much
> more complicated scheme is required to cover 100% of the use cases.
>
>
>
> Given all this, here are the concrete next steps I'd like to propose:
>
> * Everyone who is participating in this design effort should
> define what they feel should be the HPC base use case. This represents
> the simplest use case - and associated features like transactional
> submit semantics - that you feel everyone in the HPC grid world must
> implement. We will take these use case candidates and debate which one
> to actually settle on.
>
> * Everyone should define the set of HPC use cases that they
> believe might actually occur in practice. I will refer to these as the
> common use cases, in contrast to the base use case. The goal here is
> not to define the most general HPC use case, but rather the more
> restricted use cases that might occur in real life. For example, not
> all systems will support job migration, so whereas a fully general HPC
> use case would include the notion of job migration, I argue that one or
> more common use cases will not include job migration.
>
> Everyone should also prioritize and rank their common use cases so that
> we can discuss 80/20-style trade-offs concerning which use cases to
> support with any given approach to extension. Thus prioritization
> should include the notion of how common you think a use case will
> actually be, and hence how important it will be to actually support that
> use case.
>
> * Everyone should start thinking about what kinds of extension
> approaches they believe we should define, given the base use case and
> common use cases that they have identified.
>
>
>
> As multiple people have pointed out, an exploration of common HPC use
> cases has already been done one or several times before, including in
> the EMS working group. I'm still catching up on reading GGF documents,
> so I don't know how much those prior efforts explored the issue from the
> point-of-view of base case plus extensions. If these prior explorations
> did address the topic of base-plus-extensions and you agree with the
> specifics that were arrived at then this exercise will be a
> quick-and-easy one for you: you can simply publish the appropriate links
> to prior material in an email to this mailing list. I will personally
> be sending in my list independent of prior efforts in order to provide a
> "newcomer's" perspective on the subject. It will interesting to see how
> much overlap there is.
>
>
>
> One very important point that I'd like to raise is the following: Time
> is short and "best" is the enemy of "good enough". Microsoft is
> planning to provide a Web services-based interoperability interface to
> its job scheduler sometime in the next year or two. I know that many of
> the other job scheduler vendors/suppliers are also interested in having
> an interoperability story in place sooner rather than later. To meet
> this schedule on the Microsoft side will require locking down a first
> fairly complete draft of whatever design will be shipped by essentially
> the end of August. That's so that we can do all the necessary
> debugging, interoperability testing, security threat modeling, etc. that
> goes with shipping an actual finished product. What that means for the
> HPC profile work is that, come the end of August, Microsoft - and
> possibly other scheduler vendors/suppliers - will need to lock down and
> start coding some version of Web Services-based job scheduling and data
> transfer protocols. If there is a fairly well-defined, feasible set of
> specs/profile coming out of the GGF HPC working group (for
> recommendation - NOT yet for actual standards approval) that has some
> reasonable level of consensus by then, then that's what Microsoft will
> very likely go with. Otherwise Microsoft will need to defer the idea of
> shipping anything that might be GGF compliant to version 3 of our
> product, which will probably ship about 4 years from now.
>
>
>
> The chances of coming up with the "best" HPC profile by the end of
> August are slim. The chances of coming up with a fairly simple design
> that is "good enough" to cover the most important common cases by means
> of a relatively simple, restricted form of extension seems much more
> feasible. Covering a richer set of use cases would need to be deferred
> to a future version of the profile, much in the manner that BES has been
> defined to cover an important sub-category of use cases now, with a
> fuller EMS design being done in parallel as future work. So I would
> argue that perhaps the most important thing this design effort and the
> planned HPC profile working group that will be set up in Tokyo can do is
> to identify what a "good enough" version 1 HPC profile should be.
>
>
>
> Marvin.
>
>
>
>
>
> ________________________________
>
> From: Carl Kesselman [mailto:carl at isi.edu]
> Sent: Thursday, March 16, 2006 12:49 AM
> To: Marvin Theimer
> Cc: humphrey at cs.virginia.edu; ogsa-wg at ggf.org
> Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
>
>
> Hi,
>
> In the interest of furthering agreement, I was not arguing that the
> application had to be restartable. Rather, what has been shown to be
> important is that the protocol be restartable in the following sense:
> if you submit a job and the far and server fails, is the job running or
> not, if you resubmit, do you get another job instance. The GT sumbission
> protocol and Condor have a transactional semantics so that you can have
> at most once submit semantics reegardless of client and server failures.
> The fact that your application may be non-itempote is exactly why having
> a well defined semantics in this case is important.
>
> So what is the next step?
>
> Carl
>
> Dr. Carl Kesselman email: carl at isi.edu
> USC/Information Sciences Institute WWW: http://www.isi.edu/~carl
> 4676 Admiralty Way, Suite 1001 Phone: (310) 448-9338
> Marina del Rey, CA 90292-6695 Fax: (310) 823-6714
>
>
>
> -----Original Message-----
> From: Marvin Theimer <theimer at microsoft.com>
> To: Carl Kesselman <carl at isi.edu>
> CC: Marvin Theimer <theimer at microsoft.com>; Marty Humphrey
> <humphrey at cs.virginia.edu>; ogsa-wg at ggf.org <ogsa-wg at ggf.org>
> Sent: Wed Mar 15 14:26:36 2006
> Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
> Hi;
>
>
>
> I suspect that we're mostly in agreement on things. In particular, I
> think your list of four core aspects is a great starting point for a
> discussion on the topic.
>
>
>
> I just replied to an earlier email from Ravi with a description of what
> I'm hoping to get out of examining various HPC use cases:
>
> * Identification of the simplest base case that everyone will
> have to implement.
>
> * Identification of common cases we want to optimize.
>
> * Identification of how evolution and selective extension will
> work.
>
>
>
> I totally agree with you that the base use case I described isn't really
> a "grid" use case. But it is an HPC use case - in fact it is arguably
> the most common use case in current existence. J So I think it's
> important that we understand how to seamlessly integrate and support
> that common - and very simple - use case.
>
>
>
> I also totally agree with you that we can't let a solution to the
> simplest HPC use case paint us into a corner that prevents supporting
> the richer use cases that grid computing is all about. That's why I'd
> like to spend significant effort exploring and understanding the issues
> of how to support evolution and selective extension. In an ideal world
> a legacy compute cluster job scheduler could have a simple "grid shim"
> that let it participate at a basic level, in a natural manner, in a grid
> environment, while smarter clients and HPC services could interoperate
> with each other in various selectively richer manners by means of
> extensions to the basic HPC grid design.
>
>
>
> One place where I disagree with you is your assertion that everything
> needs to be designed to be restartable. While that's a good goal to
> pursue I'm not convinced that you can achieve it in all cases. In
> particular, there are at least two cases that I claim we want to support
> that aren't restartable:
>
> * We want to be able to run applications that aren't restartable;
> for example, because they perform non-idempotent operations on the
> external physical environment. If such an application fails during
> execution then the only one who can figure out what the proper next
> steps are is the end user.
>
> * We want to be able to include (often-times legacy) systems that
> aren't fault tolerant, such as simple small compute clusters where the
> owners didn't think that fault tolerance was worth paying for.
>
> Of course any acceptable design will have to enable systems that are
> fault tolerant to export/expose that capability. To my mind it's more a
> matter of ensuring that non-fault-tolerant systems aren't excluded from
> participation in a grid.
>
>
>
> Other things we agree on:
>
> * We should certainly examine what remote job submission systems
> do. We should certainly look at existing systems like Globus, Unicore,
> and Legion. In general, we should be looking at everything that has any
> actual experience that we can learn from and everything that is actually
> deployed and hence represents a system that we potentially need to
> interoperate with. (Whether a final design is actually able to
> interoperate at any but the most basic level with various exotic
> existing systems is a separate issue.)
>
> * We should absolutely focus on codifying what we know how to do
> and avoid doing research as part of a standards process. I believe that
> thinking carefully about how to support evolution and extension is our
> best hope for allowing people to defer trying to bake their pet research
> topic into standards since it provides a story for why today's standards
> don't preclude tomorrow's improvements.
>
>
>
> So I would propose that next steps are:
>
> * Continue to explore and classify various HPC use cases of
> various differing levels of complexity.
>
> * Describe the requirements - and limitations - of existing job
> scheduling and remote job submission systems.
>
> * Continue identifying and discussing key "features" of use cases
> and potential design solutions, such as the four that you identified in
> your last email.
>
>
>
> Marvin.
>
>
>
> ________________________________
>
> From: Carl Kesselman [mailto:carl at isi.edu]
> Sent: Tuesday, March 14, 2006 7:50 AM
> To: Marty Humphrey; ogsa-wg at ggf.org
> Cc: Marvin Theimer
> Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
>
>
> Hi,
>
>
>
> Just to be clear, I'm not trying to suggest that the scope be expanded.
> I agree with the approach of focusing on a baby step is a good one, and
> many of the assumptions stated in Marvin's list I am in total agreement
> with. However, in taking baby steps I think that it is important that we
> end up walking, and that in defining the use case, one can easily create
> solutions that will not get you to the next step. This is my point about
> looking at what we know how to do and have been doing in production
> settings for many years now. In my mind, one of the scope grandness
> problems has been that there has been far too little focus on codifying
> what we know how to do in favor of using a standards process as an
> excuse to design new things. So at the risk of sounding partisan, the
> simplified use case that Marvin is proposing is exactly the use case
> that GRAM has been doing for over ten years now (I think the same can be
> said about UNICORE and Legion).
>
>
>
> So let me try to be constructive. One of the things that falls out of
> Marvin's list could be a set of basic concepts/operations that need to
> be defined. These include:
>
> 1) A way of describing "local" job configuration, i.e. where to find the
> executable, data files, etc. This should be very conservative with its
> assumptions on shared file systems and accessibility. In general, what
> needs to be stated here are what are the underlying aspects of the
> underlying resource that are exposed to the outward facing interface.
>
> 2) A way of naming a submission point (should probably have a way of
> modeling queues).
>
> 3) A core set of job management operations, submit, status, kill. These
> need to be defined in such a way at to be tolerate to a variety of
> failure scenarios, in that the state needs to be well defined in the
> case of failure.
>
> 4) A state model that one can use to describe what is going on with the
> jobs and a way to access that state. Can be simple (queued, running,
> done), may need to be extensible. One can view the accounting
> information as being exposed
>
>
>
> So, one thing to do would be to agree that these are (or are not) the
> right four things that need to be defined and if so, start to flesh out
> these in a way that supports the core use case but doesn't introduce
> assumptions that would preclude more complex use cases in the future.
>
>
>
>
>
> Carl
>
>
>
> ________________________________
>
> From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of
> Marty Humphrey
> Sent: Tuesday, March 14, 2006 6:32 AM
> To: ogsa-wg at ggf.org
> Cc: 'Marvin Theimer'
> Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
>
>
> Carl,
>
>
>
> Your comments are very important. We would love to have your active
> participation in this effort. Your experience is, of course, matched by
> few!
>
>
>
> I re-emphasize that this represents (my words, not anyone else's) "baby
> steps" that are necessary and important for the Grid community. In my
> opinion, the biggest challenge will be to fight the urge to expand the
> scope beyond a small size. You cannot ignore the possibility that the
> GGF has NOT made as much progress as it should have to date.
> Furthermore, one such plausible explanation is that the scope is too
> grand.
>
>
>
> -- Marty
>
>
>
>
>
> ________________________________
>
> From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of
> Carl Kesselman
> Sent: Tuesday, March 14, 2006 8:47 AM
> To: Marvin Theimer; Ian Foster; ogsa-wg at ggf.org
> Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
>
>
> Hi,
>
>
>
> While I have no wish to engage in the "what is a Grid" argument, there
> are some elements of your base use case that I would be concerned about.
> Specifically, the assumption that the submission in into a "local
> cluster" on which there is an existing account may lead one to a
> solution that may not generalize to the solution to the case of
> submission across autonomous policy domains. I would also argue that
> ignoring issues of fault tolerance from the beginning is also
> problematic. One must at least design operations that are restartable
> (for example at most once submission semantics).
>
>
>
> I would finally suggest that while examining existing job schedule
> systems is a good thing to do, we should also examine existing remote
> submission systems (dare I say Grid systems). The basic HPC use case is
> one in which there is a significant amount implementation and usage
> experience.
>
>
>
> Thanks,
>
>
> Carl
>
>
>
>
>
> ________________________________
>
> From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of
> Marvin Theimer
> Sent: Monday, March 13, 2006 2:42 PM
> To: Ian Foster; ogsa-wg at ggf.org
> Cc: Marvin Theimer
> Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
>
>
> Hi;
>
>
>
> Ian, you are correct that I view job submission to a cluster as being
> one of the simplest, and hence most basic, HPC use cases to start with.
> Or, to be slightly more general, I view job submission to a "black box"
> that can run jobs - be it a cluster or an SMP or an SGI NUMA machine or
> what-have-you - as being the simplest and hence most basic HPC use case
> to start with. The key distinction for me is that the internals of the
> "box" are for the most part not visible to the client, at least as far
> as submitting and running compute jobs is concerned. There may well be
> a separate interface for dealing with things like system management, but
> I want to explicitly separate those things out in order to allow for use
> of "boxes" that might be managed by proprietary means or by means
> obeying standards that a particular job submission client is unfamiliar
> with.
>
>
>
> I think the use case that Ravi Subramaniam posted to this mailing list
> back on 2/17 is a good one to start a discussion around. However, I'd
> like to present it from a different point-of-view than he did. The
> manner in which the use case is currently presented emphasizes all the
> capabilities and services needed to handle the fully general case of
> submitting a batch job to a computing utility/service. That's a great
> way of producing a taxonomy against which any given system or design can
> be compared to see what it has to offer. I would argue that the next
> step is to ask what's the simplest subset that represents a useful
> system/design and how should one categorize the various capabilities and
> services he has identified so as to arrive at meaningful components that
> can be selectively used to obtain progressively more capable systems.
>
>
>
> Another useful exercise to do is to examine existing job scheduling
> systems in order to understand what they provide. Since in the real
> world we will have to deal with the legacy of existing systems it will
> be important to understand how they relate to the use cases we explore.
> In the same vein, it will be important to take into account and
> understand other existing infrastructures that people use that are
> related to HPC use cases. I'm thinking of things like security
> infrastructures, directory services, and so forth. From the
> point-of-view of managing complexity and reducing
> total-cost-of-ownership, it will be important to understand the extent
> to which existing infrastructure and services can be reused rather than
> reinvented.
>
>
>
> To kick off a discussion around the topic of a minimalist HPC use case,
> I present a straw man description of such below and then present a first
> attempt at categorizing various areas of extension. The categorization
> of extension areas is not meant to be complete or even all that
> carefully thought-out as far as componentization boundaries are
> concerned; it is merely meant to be a first contribution to get the
> discussion going.
>
>
>
> A basic HPC use case: Compute cluster embedded within an organization.
>
> * This is your basic batch job scheduling scenario. Only a very
> basic state transition diagram is visible to the client, with the
> following states for a job: queued, running, finished. Additional
> states -- and associated state transition request operations and
> functionality -- are not supported. Examples of additional states and
> associated functionality include suspension of jobs and migration of
> jobs.
>
> * Only "standard" resources can be described, for example: number of
> cpus/nodes needed, memory requirements, disk requirements, etc. (think
> resources that are describable by JSDL).
>
> * Once a job has been submitted it can be cancelled, but its
> resource requests can't be modified.
>
> * A distributed file system is accessible from client desktop
> machines and client file servers, as well as compute nodes of the
> compute cluster. This implies that no data staging is required, that
> programs can be (for the most part) executed from existing file system
> locations, and that no program "provisioning" is required (since you can
> execute them from wherever they are already installed). Thus in this
> use case all data transfer and program installation operations are the
> responsibility of the user.
>
> * Users already have accounts within the existing security
> infrastructure (e.g. Kerberos). They would like to use these and not
> have to create/manage additional authentication/authorization
> credentials (at least at the level that is visible to them).
>
> * The job scheduling service resides at a well-known network name
> and it is aware of the compute cluster and its resources by "private"
> means (e.g. it runs on the head node of the cluster and employs private
> means to monitor and control the resources of the cluster). This
> implies that there is no need for any sort of directory services for
> finding the compute cluster or the resources it represents other than
> basic DNS.
>
> * Compute cluster system management is opaque to users and is the
> concern of the compute cluster's owners. This implies that system
> management is not part of the compute cluster's public job scheduling
> interface. This also implies that there is no need for a logging
> interface to the service. I assume that application-level logging can
> be done by means of libraries that write to client files; i.e. that
> there is no need for any sort of special system support for logging.
>
> * A simple polling-based interface is the simplest form of interface
> to something like a job scheduling service. However, a simple call-back
> notification interface is a very useful addition that potentially
> provides substantial performance benefits since it can enable the
> avoidance of lots of unnecessary network traffic. Only job state
> changes result in notification messages.
>
> * There are no notions of fault tolerance. Jobs that fail must be
> resubmitted by the client. Neither the cluster head node nor its
> compute nodes are fault tolerant. I do expect the client software to
> return an indication of failure-due-system-fault when appropriate.
> (Note that this may also occur when things like network partitions
> occur.)
>
> * One does need some notion of how to deal with orphaned resources
> and jobs. The notion of job lifetime and post-expiration garbage
> collection is a natural approach here.
>
> * The scheduling service provides a fixed set of scheduling
> policies, with only a few basic choices (or maybe even just one), such
> as FIFO or round-robin. There is no notion, in general, of SLAs (which
> are a form of scheduling policy).
>
> * Enough information must be returned to the client when a job
> finishes to enable basic accounting functionality. This means things
> like total wall-clock time the job ran and a summary of resources used.
> There is not a need for the interface to support any sort of grouping of
> accounting information. That is, jobs do not need to be associated with
> projects, groups, or other accounting entities and the job scheduling
> service is not responsible for tracking accounting information across
> such entities. As long as basic resource utilization information is
> returnable for each job, accounting can be done externally to the job
> scheduling service. I do assume that jobs can be uniquely identified by
> some means and can be uniquely associated with some principal entity
> existing in the overall system, such as a user name.
>
> * Just as there is no notion of requiring the job scheduling service
> to track any but the most basic job-level accounting information, there
> is no notion of the service enforcing quotas on jobs.
>
> * Although it is generally useful to separate the notions of
> resource reservation from resource usage (e.g. to enable interactive and
> debugging use of resources), it is not a necessity for the most basic of
> job scheduling services.
>
> * There is no notion of tying multiple jobs together, either to
> support things like dependency graphs or to support things like
> workflows. Such capabilities must be implemented by clients of the job
> scheduling service.
>
>
>
> Interesting extension areas:
>
> * Additional scheduling policies
>
> o Weighted fair-share, ...
>
> o Multiple queues
>
> o SLAs
>
> o ...
>
> * Extended resource descriptions
>
> o Additional resource types, such as GPUs
>
> o Additional types of compute resources, such as desktop computers
>
> o Condor-style class ads
>
> * Extended job descriptions (as returned to requesting clients and
> sys admins)
>
> * Additional classes of security credentials
>
> * Reservations separated from execution
>
> o Enabling interactive and debugging jobs
>
> o Support for multiple competing schedulers (incl. desktop cycle
> stealing and market-based approaches to scheduling compute resources)
>
> * Ability to modify jobs during their existence
>
> * Fault tolerance
>
> o Automatic rescheduling of jobs that failed due to system faults
>
> o Highly available resources: This is partly a policy statement by
> a scheduling service about its characteristics and partly the ability to
> rebind clients to migrated service endpoints
>
> * Extended state transition diagrams and associated functionalities
>
> o Job suspension
>
> o Job migration
>
> o ...
>
> * Accounting & quotas
>
> * Operating on arrays of jobs
>
> * Meta-schedulers, multiple schedulers, and ecologies and
> hierarchies of multiple schedulers
>
> o Meta-schedulers
>
> * Hierarchical job scheduling with a meta-scheduler as the only
> entry point; forwarding jobs to the meta-scheduler from other subsidiary
> schedulers
>
> o Condor-style matchmaking
>
> * Directory services
>
> o Using existing directory services
>
> o Abstract directory service interface(s)
>
> * Data transfer topics
>
> o Application data staging
>
> * Naming
>
> * Efficiency
>
> * Convenience
>
> * Cleanup
>
> o Program staging/provisioning
>
> * Description
>
> * Installation
>
> * Cleanup
>
>
>
>
>
> Marvin.
>
>
>
> ________________________________
>
> From: Ian Foster [mailto:foster at mcs.anl.gov]
> Sent: Monday, February 20, 2006 9:20 AM
> To: Marvin Theimer; ogsa-wg at ggf.org
> Cc: Marvin Theimer; Savas Parastatidis; Tony Hey; Marty Humphrey;
> gcf at grids.ucs.indiana.edu
> Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
> efforts"
>
>
>
> Dear All:
>
> The most important thing to understand at this point (IMHO) is the scope
> of this "HPC use case," as this will determine just how minimal we can
> be.
>
> I get the impression that the principal goal may be "job submission to a
> cluster." Is that correct? How do we start to circumscribe the scope
> more explicitly?
>
> Ian.
>
>
>
> At 05:45 AM 2/16/2006 -0800, Marvin Theimer wrote:
>
> Enclosed is a paper that advocates an additional set of activities that
> the authors believe that the OGSA working groups should engage in.
>
>
>
> Broadly speaking, the OGSA and related working groups are already doing
> a bunch of important things:
>
> * There is broad exploration of the big picture, including
> enumeration of use cases, taxonomy of areas, identification of research
> issues, etc.
>
> * There is work going on in each of the horizontal areas that
> have been identified, such as EMS, data services, etc.
>
> * There is working going around individual specifications, such
> as BES, JSDL, etc.
>
>
>
> Given that individual specifications are beginning to come to fruition,
> the authors believe it is time to also start defining vertical
> profilesthat precisely describe how groups of individual specifications
> should be employed to implement specific use cases in an interoperable
> manner. The authors also believe that the process of defining these
> profiles offers an opportunity to close the design loopby relating the
> various on-going protocol and standards efforts back to the use cases in
> a very concrete manner. This provides an end-to-end setting in which to
> identify holes and issues that might require additional protocols and/or
> (incremental) changes to existing protocols. The paper introduces both
> the general notion of doing focused vertical design effortsand then
> focuses on a specific vertical design effort, namely a minimal HPC
> design.
>
>
>
> The paper derives a specific HPC design in a first principlesmanner
> since the authors believe that this increases the chances of identifying
> issues. As a consequence, existing specifications and the activities of
> existing working groups are not mentioned and this paper is not an
> attempt to actually define a specifications profile. Also, the absence
> of references to existing work is not meant to imply that such work is
> in any way irrelevant or inappropriate. The paper should be viewed as a
> first abstract attempt to propose a new kind of activity within OGSA.
> The expectation is that future open discussions and publications will
> explore the concrete details of such a proposal.
>
>
>
> This paper was recently sent to a few key individuals in order to get
> feedback from them before submitting it to the wider GGF community.
> Unfortunately that process took longer than intended and some members of
> the community may have already seen a copy of the paper without knowing
> the context within it was written. This email should hopefully dispel
> any misconceptions that may have occurred.
>
>
>
> For those people who will be around on for the F2F meetings on Friday,
> Marvin Theimer will be giving a talk on the contents of this paper at a
> time and place to be announced.
>
>
>
> Marvin Theimer, Savas Parastatidis, Tony Hey, Marty Humphrey, Geoffrey
> Fox
>
>
>
> _______________________________________________________________
> Ian Foster www.mcs.anl.gov/~foster
> Math & Computer Science Div. Dept of Computer Science
> Argonne National Laboratory The University of Chicago
> Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A.
> Tel: 630 252 4619 Fax: 630 252 1997
> Globus Alliance, www.globus.org <http://www.globus.org/>
>
>
More information about the ogsa-wg
mailing list