=?UTF-8?Q?Lodewijk_andr=C3=A9_de_la_porte?= <l@odewijk.nl> writes:
So how do they do that? If there's power failure on a specific box, what happens? Are all transactions synced to disk before commit, thus minimal rollbacks? A minimal rollback takes a very small margin of what would happen in case of power failure on a box. Maybe they have several boxes advocating a single transaction, so that expectible failures would never crash a system completely.
This was a software guy (quoting what he knew about some of the special hardware features), so he didn't go into that much detail on this sort of thing, but in any case it's problem that's been (mostly) solved for decades, just look for discussions of high-availability systems (https://archive.org/details/reliablecomputer00siew is one good starting point). It's not for nothing that, for example, Tandems are sold under the name NonStop (they're covered in a case study in the book referenced above). I was in a Tandem shop some years ago when it experienced a rapid sequence of power glitches. The mass of IT gear in the building needed everything from a reboot to a reinstall to hardware replacement to get working again. One of their techies took me into the mainframe room to the Tandem console, which had a series of reports "Power lost / Power restored / Power lost / ...". Apart from that there had been no effect. There's a story that during the Loma Prieta earthquake a data centre containing a Tandem machine was damaged in the quake. It continued running, lying on its side surrounded by debris, until they could bring in heavy equipment to push it upright again. Peter.