[ogsa-wg] Paper proposing "evolutionary vertical design efforts"

Tue Mar 21 22:27:53 CST 2006

This could easily turn into a propeller-head discussion, but I
disagree that there is a two-phase commit implied.

This is really just a simple issue of reliable message queueing with a
single-phase commit.  The client is already "committed" to running the
job before he sends the first message!

The use of idempotent messaging provides for reliable hand off of jobs
from client to scheduler to avoid duplicates, with a bias of assuming
success and only having extra messages when some perturbing event has
occured.

On the back end, I think schedulers already worry about at-most-once
execution among their resources, specifically when the application
isn't classed as "restartable".

In practice, what bites people with these big job flows is the in
flight job that can be "orphaned" because some disruption obliterates
the positive acknowledgement that was returning the scheduler-issued
job ID, and there is no other correlation ID to use to go back and
determined what really happened.

karl

On Mar 21, Marvin Theimer modulated:
> 
> Stating the question somewhat differently: does LSF write a log record
> to stable storage before running any given job request?  If so, then
> adding at-most-once semantics wouldn’t be too hard.  Note, however,
> that exactly-once semantics would require a (distributed) two-phase
> commit to ensure that the log record accurately reflects whether or not
> the job actually got started on some (remote) compute resource.
> 
>  
> 
> Marvin.
> 

-- 
Karl Czajkowski
karlcz at univa.com