"Whew, wondered where we'd put those 200,000 BTC!"

Mon Mar 24 18:11:10 PDT 2014

=?UTF-8?Q?Lodewijk_andr=C3=A9_de_la_porte?= <l at odewijk.nl> writes:

>So how do they do that? If there's power failure on a specific box, what
>happens? Are all transactions synced to disk before commit, thus minimal
>rollbacks? A minimal rollback takes a very small margin of what would happen
>in case of power failure on a box. Maybe they have several boxes advocating a
>single transaction, so that expectible failures would never crash a system
>completely.

This was a software guy (quoting what he knew about some of the special
hardware features), so he didn't go into that much detail on this sort of
thing, but in any case it's problem that's been (mostly) solved for decades,
just look for discussions of high-availability systems 
(https://archive.org/details/reliablecomputer00siew is one good starting
point).

It's not for nothing that, for example, Tandems are sold under the name
NonStop (they're covered in a case study in the book referenced above).  I was
in a Tandem shop some years ago when it experienced a rapid sequence of power
glitches.  The mass of IT gear in the building needed everything from a reboot
to a reinstall to hardware replacement to get working again.  One of their
techies took me into the mainframe room to the Tandem console, which had a
series of reports "Power lost / Power restored / Power lost / ...".  Apart
from that there had been no effect.

There's a story that during the Loma Prieta earthquake a data centre
containing a Tandem machine was damaged in the quake.  It continued running,
lying on its side surrounded by debris, until they could bring in heavy
equipment to push it upright again.

Peter.