updated text of the MTTF paper

17 Dec 2003

      What I sent out was dashed together last June and never polished.
Here's text after one re-write...

 - Carl

(Equations haven't changed...)

--------------------------------------------------------------------

\documentstyle[12pt]{article}

\begin{document}

\title{MTTF of Various Systems}
\author{Carl M. Ellison \thanks{Stratus Computer Inc., 55 Fairbanks Blvd.,
Marlborough MA 01752. Email address: {\tt cme@sw.stratus.com}.}}
\date{October 25, 1993}

\maketitle

\begin{abstract}
Expressions are presented for the Mean Time To Failure (MTTF) of various
redundant systems, as a function of the number of nodes in the system, N,
and the minimum number of nodes in a working system, K.  Failure of a
system is defined as having fewer than K working nodes.
\end{abstract}

\section{Expressions} 
Redundant systems are employed both to increase availability and to achieve
preservation of data.

Expressions for the availability of a redundant system are to be found in
normal probability and statistics texts.  [See for example, Kishor S.
Trivedi, "Probability and Statistics with Reliability, Queuing, and
Computer Science Applications" from Prentice-Hall, 1982.]  These assume
that once a system has recovered from total failure, it is as usable as it
was before the system failure.

Mean Time To Failure (MTTF) is concerned with the case that once a system
has achieved total failure, something is lost.  This might apply to
redundant disks, for example.

The expressions presented below are for the MTTF of various redundant
systems, as a function of the number of nodes in the system, N, and the
minimum number of nodes in a working system, K.  Failure of a system is
defined as having fewer than K working nodes.  Often, K=1 and each node has
a complete copy of each database.  However, sometimes the data can be kept
on multple nodes (as in a RAID-5 disk array) which will tolerate some
failures, down to a given threshold, K > 1.

It is assumed that as soon as a failure occurs, a repair cycle will be
started.  There is then a race to see if the repair can be completed before
enough additional nodes fail to drop the working number below K.

The expressions below were derived from custom Markov chains, each built to
model a given choice of N and K.  It is assumed that both failures and
repairs are exponentially distributed random events (so that the Markov
chains remain memoryless).  This is a reasonable model for failures but
not for repairs.  Therefore, these expressions are approximations.

The expressions also assume that all failure events are independent.  For
example, a multi-node system in which all nodes are on the same power grid
would not have completely independent failure events.

N: number of nodes in a full system

K: number of nodes in a minimally functional system

$\lambda$: rate of failures (e.g., number of node failures per year)

$\mu$: rate of node repair (in the same units as $\lambda$)

Each fraction below is the MTTF of the whole system: the mean time until a
system drops to (K-1) working nodes.

N =  2 ;  K =  1

\begin{equation}
\frac { 3\lambda + 1\mu }{ 2\lambda^2 }
\end{equation}

N =  3 ;  K =  1

\begin{equation}
\frac { 11\lambda^2 + 4\lambda\mu + 1\mu^2 }{ 6\lambda^3 }
\end{equation}

N =  3 ;  K =  2

\begin{equation}
\frac{ 5\lambda + \mu }{ 6\lambda^2 }
\end{equation}

N =  4 ;  K =  1

\begin{equation}
\frac { 50\lambda^3 + 18\lambda^2\mu + 5\lambda\mu^2 + 1\mu^3 }{ 24\lambda^4 }
\end{equation}

[etc.]

\end{document}

cme＠ellisun.sw.stratus.com

tags

participants (1)