[ogsa-wg] Paper proposing "evolutionary vertical design efforts"

Marvin Theimer theimer at microsoft.com
Wed Mar 22 21:21:41 CST 2006


Hi;

One way or the other, if you want true transactional semantics for an
action spanning multiple sites that can each fail in a distributed
system then you will need to employ something like two-phased commit or
Paxos.  

But we're rat-holing on an issue that doesn't matter.  Your last
paragraph is the important one and I agree with it.

Marvin.

-----Original Message-----
From: Karl Czajkowski [mailto:karlcz at univa.com] 
Sent: Tuesday, March 21, 2006 8:28 PM
To: Marvin Theimer
Cc: Christopher Smith; ogsa-wg at ggf.org; Ian Foster; Marty Humphrey; Carl
Kesselman
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

This could easily turn into a propeller-head discussion, but I
disagree that there is a two-phase commit implied.

This is really just a simple issue of reliable message queueing with a
single-phase commit.  The client is already "committed" to running the
job before he sends the first message!

The use of idempotent messaging provides for reliable hand off of jobs
from client to scheduler to avoid duplicates, with a bias of assuming
success and only having extra messages when some perturbing event has
occured.

On the back end, I think schedulers already worry about at-most-once
execution among their resources, specifically when the application
isn't classed as "restartable".

In practice, what bites people with these big job flows is the in
flight job that can be "orphaned" because some disruption obliterates
the positive acknowledgement that was returning the scheduler-issued
job ID, and there is no other correlation ID to use to go back and
determined what really happened.

karl


On Mar 21, Marvin Theimer modulated:
> 
> Stating the question somewhat differently: does LSF write a log record
> to stable storage before running any given job request?  If so, then
> adding at-most-once semantics wouldn't be too hard.  Note, however,
> that exactly-once semantics would require a (distributed) two-phase
> commit to ensure that the log record accurately reflects whether or
not
> the job actually got started on some (remote) compute resource.
> 
>  
> 
> Marvin.
> 


-- 
Karl Czajkowski
karlcz at univa.com





More information about the ogsa-wg mailing list