updated text of the MTTF paper
What I sent out was dashed together last June and never polished. Here's text after one re-write... - Carl (Equations haven't changed...) -------------------------------------------------------------------- \documentstyle[12pt]{article} \begin{document} \title{MTTF of Various Systems} \author{Carl M. Ellison \thanks{Stratus Computer Inc., 55 Fairbanks Blvd., Marlborough MA 01752. Email address: {\tt cme@sw.stratus.com}.}} \date{October 25, 1993} \maketitle \begin{abstract} Expressions are presented for the Mean Time To Failure (MTTF) of various redundant systems, as a function of the number of nodes in the system, N, and the minimum number of nodes in a working system, K. Failure of a system is defined as having fewer than K working nodes. \end{abstract} \section{Expressions} Redundant systems are employed both to increase availability and to achieve preservation of data. Expressions for the availability of a redundant system are to be found in normal probability and statistics texts. [See for example, Kishor S. Trivedi, "Probability and Statistics with Reliability, Queuing, and Computer Science Applications" from Prentice-Hall, 1982.] These assume that once a system has recovered from total failure, it is as usable as it was before the system failure. Mean Time To Failure (MTTF) is concerned with the case that once a system has achieved total failure, something is lost. This might apply to redundant disks, for example. The expressions presented below are for the MTTF of various redundant systems, as a function of the number of nodes in the system, N, and the minimum number of nodes in a working system, K. Failure of a system is defined as having fewer than K working nodes. Often, K=1 and each node has a complete copy of each database. However, sometimes the data can be kept on multple nodes (as in a RAID-5 disk array) which will tolerate some failures, down to a given threshold, K > 1. It is assumed that as soon as a failure occurs, a repair cycle will be started. There is then a race to see if the repair can be completed before enough additional nodes fail to drop the working number below K. The expressions below were derived from custom Markov chains, each built to model a given choice of N and K. It is assumed that both failures and repairs are exponentially distributed random events (so that the Markov chains remain memoryless). This is a reasonable model for failures but not for repairs. Therefore, these expressions are approximations. The expressions also assume that all failure events are independent. For example, a multi-node system in which all nodes are on the same power grid would not have completely independent failure events. N: number of nodes in a full system K: number of nodes in a minimally functional system $\lambda$: rate of failures (e.g., number of node failures per year) $\mu$: rate of node repair (in the same units as $\lambda$) Each fraction below is the MTTF of the whole system: the mean time until a system drops to (K-1) working nodes. N = 2 ; K = 1 \begin{equation} \frac { 3\lambda + 1\mu }{ 2\lambda^2 } \end{equation} N = 3 ; K = 1 \begin{equation} \frac { 11\lambda^2 + 4\lambda\mu + 1\mu^2 }{ 6\lambda^3 } \end{equation} N = 3 ; K = 2 \begin{equation} \frac{ 5\lambda + \mu }{ 6\lambda^2 } \end{equation} N = 4 ; K = 1 \begin{equation} \frac { 50\lambda^3 + 18\lambda^2\mu + 5\lambda\mu^2 + 1\mu^3 }{ 24\lambda^4 } \end{equation} [etc.] \end{document}
participants (1)
-
cme@ellisun.sw.stratus.com