On building systems that will fail
source link: https://fermatslibrary.com/s/on-building-systems-that-will-fail
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
To FernandoCorbato
forwork in organizing the
concepts and leading the development
of the general-purpose large-scale
time-sharing and resource-sharing
computer systems CTSS and MULTICS
T
ii
FERNANDO J. eORBAT6
I
t is an honor and a pleasure to
acceptthe Alan Turing
Award. My own work has
been on computer systems,
and thatmy theme.
The essence of systems is that
they areintegrating efforts, requir-
ing broad knowledge of the prob-
lem area to be addressed, and the
detailed knowledge required is
rarely held by one person.the
work of systems is usually done by
teams. Hence I am accepting this
award on behalf of the many with
whomworked as much as for
myself. It is not practical toall
the individuals who contributed.
Nevertheless, I would like to give
special mention to Marjorie Dag-
gett and Bobfor theirparts
in the birth of CTSS Bob
Fano and the Ted Glaser for
their critical contributions to the
development of the Multics System.
Letme turn now to the title of
this talk: "On Building Systems
That WillFail." Of course the title I
chose was a teaser. I considered and
discarded some alternate titles: "On
Building Messy Systems," but it
seemed too frivolous and suggests
there is no systematic approach.
"On Mastering System Complexity"
sounded like I have all the answers.
The title that came closest, "On
Building Systems that are likely to
have Failures" did not the
nuance of inevitability that I
wanted to suggest.
What I am really trying ad-
dress is the class of systems that for
want of a better phrase, I will call
"ambitious systems." It almost goes
without saying that ambitious sys-
tems never quite work as expected.
Things usually go wrong--
sometimes in dramatic ways. And
this leads memy main thesis,
namely,that the question to ask
when designing such systems is not:
"/f something will go but
when
it will go wrong?"
Some Examples
Now, ambitious systems that fail are
really much more common than we
may realize. In fact in some circum-
stances we strive for them, revelling
in the excitement of the unex-
pected. For example, let me remind
you of our national sport of foot-
ball. The whole object of the game
is for each playthe limit
of its abilities. Besides the sheer
physical skill required, onethe
strategic intricacies, the ability to
audibilize, and the quickness to
react to the unexpected--all a deep
part of thegame. Of course, occa-
sionally one team approaches per-
fection, all the plays work, the
game becomes dull.
Another example of a system
that is too ambitious for perfection
is military warfare. same ele-
ments are there with opposing sides
having to constantly improvise and
dealthe unexpected. In fact
we get from themilitary that won-
derful acronym, SNAFU, which is
politely translated as "situation nor-
mal, all fouled up." And if any of
you are still doubtful, consider how
rapidly the phrases "precision
bombing" and "surgical strikes" are
replaced by "thefog of war" and
"casualties from friendly fire" as
soon as hostilities begin.
On a somewhat more whimsical
note, let me offer driving in Boston
as an example of systems that
will
fail. Automobile traffic is an excel-
lent case of distributed control with
a commonof protocols called
trafficregulations.Boston area
is notorious for the free interpreta-
tions drivers make of these pesky
regulations, and perhaps the epit-
ome of it occurs in the arena of the
traffic rotary. A case can be made
forrotaries. They are efficient.
There is no need to wait for slug-
gish traffic signals. They are direct.
And they offer great opportunities
for creative improvisation, thereby
adding zestto the of driving.
One of the most effective strate-
gies is for a driver approaching a
rotary to rigidly fix his or her head,
staring forward, of course, secretly
using peripheral vision to the limit.
Iteven more effective if the
driver on entering the rotary,
speeds up, and some drivers embel-
lish this last step by adopting a look
of maniacal glee. The effect is, of
course, one of intimidation, and a
pecking order quickly develops.
The only reason thereare not
more accidentsmost drivers
have a second component to the
strategy, namely,assume
everyone else may be crazy--they
are often correct--and every driver
is really prepared to stop with
inches to spare. wesee an
example of a system where ambi-
tious tactics and prudent caution
lead to an effective solution.
So far, the examples I have given
may suggestfailures of ambi-
tious systems come from the human
element and that at least the techni-
cal parts of the system can be built
correctly. In particular, turning to
computer systems, it is onlymat-
ter of getting the code debugged.
Some assume rigorous testing will
do the job. Some put theirhopes in
proving program correctness. But
unfortunately, there are many cases
for which none of these techniques
will always work [1]. Let meoffer a
modest example illustrated in Fig-
ure 1.
Consider the case of an elaborate
numerical calculation with a vari-
able, f, representing some physical
value, being calculated for a set of
points over a range of a parameter,
t. Nowthe propertyof physical
variables is thatthey normally do
not exhibit abrupt changes or dis-
continuities.
So what has happened here? If
we lookthe expression forf, we
see it is theresultconstant, k,
added to the product of two other
functions, g and h. Looking further,
we see that the function g has a be-
havior that is exponentially increas-
ing with t. The function h, onthe
other hand, is exponentially de-
creasing with t. The resultant prod-
uct of gh is almost constant
with increasing t until an abrupt
jump occurs and the curve for f
goes flat.
What has gone wrong? The an-
swer is that there has been floating-
point underflow at the critical point
in the curve, i.e., the representation
of thenegative exponent
ceeded the field size in the floating-
COMMUNICATIONS OF THE ACM/September
1991/Vol.34, No.9 7
3
I
A Subtle Bug
Where
f(t)=k+ g(t).h(t)
g(t)-exp(at)
(a>O)
h(t)-exp(-bt)
(b>O)
t -.--I1,,.-
•.. Why Mishaps?
iii
Performance
100
10
MIPS
1
:IGURE
qGURE !
0.1
195019701990
Year
point representation for this partic-
ular computer, and the hardware
has automatically set the value for
the function h tozero. Often this is
reasonable since small numbers are
correctly approximated by zero--
but not in this case, where our re-
sults are grossly wrong. Worse yet,
since the computation off might be
internal, it is imagine that
the failureshown here would not
be noticed.
Because correctly handling the
pathology that this examplerepre-
sents is an extra engineering
bother, it should not be surprising
that the problem of underflow is
frequently ignored. But the larger
lesson to be learned from this ex-
ample is that subtle mistakes are
very difficult to avoid some
extent are inevitable.
I encountered my next example
when I was a graduate student pro-
gramming on the pioneering
Whirlwind computer. One night
while awaiting my turn to use it, the
graduate student before me began
complaining of how "tough" some
of his calculations were. He said he
was computing the vibrational fre-
quencies of a particular wing struc-
ture for a series of cases. In fact, his
equations were cubics, and he was
using the iterative Newton-Raph-
son method. For reasons he did not
understand, his method was find-
ing one of the roots, but not "con-
verging" for the others. He was try-
ing to fix this situation by changing
his program so that when he en-
countered one of these tough roots,
the program would abandon the
iteration
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK