[graap-wg] A Highly available, Fault tolerant Co-scheduling System

Jon MacLaren maclaren at cct.lsu.edu
Fri Oct 7 07:05:38 CDT 2005


For those of you not in Boston this week, I gave an overview of a  
Highly available, Fault tolerant Co-scheduling System, which I've  
built at LSU.  We successfully used this system at the iGrid 2005  
meeting to co-schedule 10 compute jobs and 2 Calient DiamondWave  
switches.

The system uses the Paxos Commit protocol (Lamport, Gray) to overcome  
the problems associated with distributed 2-phase commit.  The co- 
scheduler, when deployed as 5 processes, has a mean-time-to-failure  
of about 12 years.

The software is available for download, but, as I said at the  
meeting, I'm still writing up the installation instructions.  There  
is a mailing list about the work, though.  And if you go to the web- 
site for this stuff, you can sign up, and receive updates.

The link is:  http://www.cct.lsu.edu/personal/maclaren/CoSched/

Cheers,

Jon MacLaren.





More information about the graap-wg mailing list