[ogsa-wg] Paper proposing "evolutionary vertical design efforts"

Wed Mar 22 21:10:35 CST 2006

Hi;

You raise good questions -- all of which are reasonable things to try to
address in one or more extensions.

Concerning the high throughput community, my impression has been that
many (but by no means all) workloads consist of essentially idempotent
jobs, so that providing at-most-once semantics isn't that crucial since
running a job more than once by accident doesn't really hurt anything.
What the client wants is to one-way-or-another get results back for all
the work items to be done.  I don't quite understand your reference to
statistical failure rates and would be interested to learn more about
what you mean.  It seems to me that a client will keep resubmitting jobs
until he gets answers back for all of the work items they represent,
irrespective of random or non-random failure rates in job processing.
Perhaps I'm misunderstanding the workloads you have in mind.

Thanks,
Marvin.

-----Original Message-----
From: owner-ogsa-wg at ggf.org [mailto:owner-ogsa-wg at ggf.org] On Behalf Of
Karl Czajkowski
Sent: Tuesday, March 21, 2006 8:16 PM
To: Marvin Theimer
Cc: Ian Foster; Marty Humphrey; Carl Kesselman; ogsa-wg at ggf.org
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

On Mar 21, Marvin Theimer modulated:

> I know that systems like LSF get used in high throughput settings
where
> the service time for a job request is an issue....
> ...  If
> my assumption is correct, then this common use case in the HPC world
> may be one that many, if not most job schedulers would have a hard
time
> supporting if they have to provide at-most-once transactional
semantics
> for all job submissions. 
> 

I do not think anyone claimed that at-most-once semantics should be
mandated on all requests.  Certainly nobody from Globus says
this... it is an optional feature of our job submission protocol, to
be chosen by the client depending on their needs.

I think the question is much more about whether (or how many times) an
optional at-most-once extension mechanism is defined.  Secondarily,
there is the question of efficiently determining if it (as an
extension) is available in a remote service.  A third interesting
question might be determining what the "cost" of the extension is
versus the cost of having lost jobs against an unknown remote service
implementation when setting up to do an extremely high throughput run
as you describe.

The high throughput case is interesting to me, because it is precisely
the user community that demanded an efficient at-most-once semantics
from GRAM!  They are the ones who blast enough jobs through to notice
statistical failure rates and the cost of recovery.

karl

-- 
Karl Czajkowski
karlcz at univa.com