[graap-wg] A Highly available, Fault tolerant Co-scheduling System

Mon Oct 10 10:58:19 CDT 2005

Karl,

Thanks for the email.  I was sorry that you weren't at the presentation.

I've replied to stuff inline below.  However, a couple of general  
points/observations.

First, what would I gain from using WS-Agreement in the way you  
propose?  At the moment, we have a nice co-allocation scheme, where  
the co-allocators don't need to know anything about the payload of  
the message - it can even be encrypted  (An important separation, in  
my opinion.)  Also, my scheme currently uses XML over HTTP.  It could  
use WS-I, if we wanted to add SOAP.  But it's just XML messages,  
essentially.  If WS-Agreement was "merged" with this scheme, I'd then  
need to use WS-RF, which is not amenable to everyone.  (In any case,  
the impression from reading this below, is that to use WS-Agreement  
in this way feels like a bit of a hack.)

Also, below, you are talking about using WS-Agreement as the protocol  
between the entity doing the co-scheduling and the resource managers  
(RMs).  I'm not envisaging the user doing co-scheduling directly -  
it's complex, and I'd rather encapsulate it, as in my  
implementation.  If you had a service doing this for you, your scheme  
would need two levels of WS-Agreement, between the user and co- 
scheduler and the co-scheduler and RMs.  Is this what you imagine, or  
do you think the user should take this on directly?

On Oct 8, 2005, at 1:36 AM, Karl Czajkowski wrote:
> On Oct 07, Jon MacLaren modulated:
> ...
>
>> The system uses the Paxos Commit protocol (Lamport, Gray) to overcome
>> the problems associated with distributed 2-phase commit.
>>
>
> Jon:
>
> As no doubt you'll remember, it has been proposed that advance
> reservation is an approach to distributed co-allocation.
> Specifically, advance reservation agreements can be seen as the
> "prepare" step in the 2PC protocol and the subsequent claiming
> agreements can be seen as the "commit" step.  As such, we can envision
> WS-Agreement being used in the protocol between the 2PC transaction
> manager and the resources (as well as between the initiator and the
> transaction manager).

I remember.  As I pointed out, this hides the nature of what is going  
on (co-allocation) from the resource manager (RM).  In my  
implementation, the RM is aware of the difference between a prepare- 
 >prepared->abort sequence and a normal reserve followed by  
cancellation.  Hiding this difference, as you propose, has  
implications on charging schemes, allocation quotas, etc. - it is, as  
I believe, too restrictive.

> Anyway, I read up on Paxos a bit, and as far as I can tell it has
> these same underlying mechanisms of prepare/commit at the individual
> resources.  In essence, it is a way of making distributed transaction
> managers as a group consensus on top of the same basic parties: one
> who initiates the transaction and N who participate in it. It adds 2F
> additional processes in between the initiator and resources to
> tolerate F process failures.  Having actually studied and implemented
> such a system, do you think this is an accurate summary?

Basically that's correct.  Only a majority of acceptors have to  
remain operable.  There are some other nice properties too though:
1. All messages are idempotent (multiple delivery is OK)
2. Messages do not have to be reliable or arrive in a timely manner
There are other "goodies" too - others looking at this should chase  
down the references (see my coscheduling web page).

There is one instance of Paxos Consensus for each RM decision  
(prepared/aborted).

> Is there is anything you can identify that is missing from
> WS-Agreement that would allow it to be used at each resource in the
> Paxos Commit protocol in the same manner that we have intended it to
> be used in the "prepare" and "commit" steps of the 2PC protocol?
> E.g. two separate agreements at each resource to represent the two
> phases?

Yes.  If you look again at the "Consensus on Transaction Commit"  
paper, you'll see that the RMs have to report their prepared/aborted  
decision to all the acceptors.  WS-Agreement does not do this.  (It  
could work with classic 2-phase commit, with a single transaction  
manager, but neither of us are interested in this.)

> My understanding is that there needs to be a way to name the agreement
> such that each of the Paxos processes can find the same answer to the
> "prepare" step at each resource.  Can Paxos elect a "leader" who
> initiates the prepare step, e.g. CreateAgreement, so that the others
> can just check the result status, e.g. RP query?  Or would a truly
> idempotent CreateAgreement process be required so that any process can
> initiate the prepare step and all will learn the same result using the
> same message pattern, regardless of which contacts the resource first?
> By the way, I think this latter behavior could be solved at the WS
> binding level, using the current WS-Agreement definitions.  This
> would be a different application of the same idempotent-submit
> mechanism we use in WS-GRAM for simple reliability...

The initial leader *does* initiate the prepare (see the paper  
again).  However, you don't know if that message got through.   
Prepare is not resent to the RMs.  Instead, if the response does not  
arrive within a given time, the leader will ask for another ballot in  
the RM's instance of Paxos.  If the ballot is "free", i.e. no  
acceptor has seen a response from the RM, then the leader will  
propose the value "aborted" for this round of Paxos.

You can't rely on the creation of the RP thing in order to discover  
the decision later on.  What if the RM is down?

> Of course, this use of agreements for the phases requires a certain
> set of additional assumptions about how deterministic the claim step
> is, once a reservation is held; otherwise, the semantics of the
> "prepare" step (and the whole transaction) becomes wishy-washy.
> Particularly, if the reservation agreements are constrained in time
> (e.g. a typical wall-clock advance reservation scenario), the commit
> protcol can be violated because the preparation can expire before the
> commit phase is completed (violating the ACID properties). As I
> understand it, Paxos can reduce the likelihood of delays due to
> transaction manager failure, but arbitrary delay is still a hazard
> with realistic messaging models, i.e. Internet-based services, because
> of unbounded message delay/loss to the distributed resources that are
> being coordinated.

This is true.  However, Paxos handles message delays/non-arrival by  
having subsequent ballots.  It recovers automatically from this - it  
doesn't just block.  So individual messages being delayed is not a  
problem.  For Paxos not to make progress, you need to engineer a  
situation where there is no majority of acceptors still working.   
What do you think the chances are of messages being systematically  
delayed between a number of processes?

If you crunch the numbers on all these failures (I used an example of  
acceptors being inoperable for one hour out of 24 hours), you find  
that the likelihood of a 5-acceptor Paxos round blocking is very,  
very small (once in a number of years).

That's good enough for me.

Jon.

> Thoughts?
>
> karl

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/graap-wg/attachments/20051010/9a72e36f/attachment.html