[graap-wg] A Highly available, Fault tolerant Co-scheduling System
Jon MacLaren
maclaren at cct.lsu.edu
Fri Oct 7 07:05:38 CDT 2005
For those of you not in Boston this week, I gave an overview of a
Highly available, Fault tolerant Co-scheduling System, which I've
built at LSU. We successfully used this system at the iGrid 2005
meeting to co-schedule 10 compute jobs and 2 Calient DiamondWave
switches.
The system uses the Paxos Commit protocol (Lamport, Gray) to overcome
the problems associated with distributed 2-phase commit. The co-
scheduler, when deployed as 5 processes, has a mean-time-to-failure
of about 12 years.
The software is available for download, but, as I said at the
meeting, I'm still writing up the installation instructions. There
is a mailing list about the work, though. And if you go to the web-
site for this stuff, you can sign up, and receive updates.
The link is: http://www.cct.lsu.edu/personal/maclaren/CoSched/
Cheers,
Jon MacLaren.
More information about the graap-wg
mailing list