Building Secure and Reliable Network Applications

Kenneth P. Birman

1996 | 591 pages
ISBN: 1884777295

Out of print $58.00 Hardbound print book  
 

RESOURCES

DESCRIPTION

As the "network is the computer" slogan becomes reality so reliability and security of networked applications become more important. Not only are hospitals, air traffic control systems, and telephone networks becoming more networked, but business applications are increasingly based on the open world of the Internet. Stability in the face of minor accidents, software or hardware failures, or outright attack has become vital. This book provides a structured approach to the technologies currently available for building reliable solutions to these problems.

Building Secure and Reliable Network Applications reviews the most important network technologies from a security and reliability perspective and discusses the most effective solutions with an eye towards their application to real-world systems. Any computing professional who works with networked software will find this book valuable in understanding security and reliability vulnerabilities and how to address them.

What's inside:

WHAT THE EXPERTS SAY ABOUT THIS BOOK...

"... a must read for anyone wishing to know the state of the art in reliability ."
--Dalia Malki, AT&T Labs

"... tackles the difficult problem of building reliable distributed computing systems in a way that not only presents the principles but also describes proven practical solutions."
--John Warne, BNR Europe

ABOUT THE AUTHOR...

Ken Birman is an authority on reliable and secure distributed computing and the lead developer of the ISIS system used by over 300 companies worldwide. A Professor of Computer Science at Cornell University, he is also Editor-in-Chief of ACM Transactions on Computer Systems.

Sample Chapters

Two sample chapters are available for download.

User's Guide

This book was written with several types of readers in mind, and consequently it weaves together material that may be of greater interest to one type of reader than that aimed at another type of reader.

Practitioners will find that the book has been constructed to be readable more or less sequentially from start to finish. The first part of the book may well be familiar material to many practitioners, but we try to approach this as a perspective of understanding reliability and consistency issues that arise even when using the standard distributed system technologies. We also look at the important roles of performance and modularity in building distributed software that can be relied upon. The second part of the book, which focuses on the Web, is of a similar character. Even if experts in this area may be surprised by some of the subtle reliability and consistency issues associated with the Web, they may find the suggested solutions useful in their work.

The third part of the book looks squarely at reliability technologies. Here, a pragmatically oriented reader may want to skim through Chapters 13 through 16, which cover the details of some fairly complex protocols and programming models. This material is included for thoroughness, and I don't think it is exceptionally hard to understand. However, the developer of a reliable system doesn't necessarily need to know every detail of how the underlying protocols work, or how they are positioned relative to some of the theoretical arguments of the decade. The remainder of the book can be read without having read through these chapters in any great detail. Chapters 17 and 18 look at the use of tools through an approach based on wrappers, and Chapters 19 through 24 look at some related issues concerning topics such as real-time systems, security, persistent data, and system management. The content is practical and the material is intended to be of a hands-on nature. Thus, the book is designed to be read more or less in order by system developers, with the exception of those parts of Chapters 13 through 16 where the going gets a bit heavy.

Where possible, the book includes general background material: There is a section on ATM networks, for example, that could be read independently of the rest of the book, one on CORBA, one on message-oriented middleware, and so forth. As much as practical, I have tried to make these sections freestanding and to index them properly, so that if one were worried about security exposures of the NFS file system, for example, it would be easy to read about that specific topic without reading the entire book as well. Hopefully, practitioners will find this book useful as a general reference for the technologies covered, and not purely for its recommendations in the area of security and reliability.

Next, here are some comments directed toward other researchers and instructors who may read or choose to teach from this book. I based the original outline of this book on a course that I have taught several times at Cornell, to a mixture of fourth-year undergraduates, professional master's degree students, and first-year Ph.D. students. To facilitate the development of course materials, I have placed my slides (created using the Microsoft PowerPoint utility) on Cornell University's public file server, where they can be retrieved using FTP. (Copy the files from ftp.cs.cornell.edu/pub/ken/slides.) The book also includes a set of problems that can be viewed either as thought-provoking exercises for the professional who wishes to test his or her own understanding of the material, or as the basis for possible homework and course projects in a classroom setting.

Any course based on this book should adopt the same practical perspective as the book itself. I suspect that some of my research colleagues will consider the treatment broad but somewhat superficial; this reflects a decision to focus primarily on system issues, rather than on theory or exhaustive detail on any particular topic. In making this decision, compromises had to be accepted: When teaching from this book, it may be necessary to ask the students to read some of the more theoretically rigorous books, which are cited in subsections of interest to the instructor, and to look in greater detail at some of the systems that are mentioned only briefly here. On the positive side, however, there are few, if any, introductory distributed system books that try to provide a genuinely broad perspective on issues in reliability. In my experience, many students are interested in this kind of material today, and, having gained a general exposure to it, would then be motivated to attend a much more theoretical course focused on fundamental issues in distributed systems theory. Thus, while this book may not be sufficient in and of itself for launching a research effort in distributed computing, it could well serve as a foundation for such an activity.

It should also be noted that, in my own experience, the book is too long for a typical 12-week semester. Instructors who elect to teach from it should be selective about the material that will be covered, particularly if they intend to treat Chapters 13 through 17 in any detail. If one has the option of teaching over two semesters, it might make sense to split the course into two parts and to include supplemental material on the Web. I suspect that such a sequence would be very popular given the current interest in network technology. At Cornell, for example, I tend to split this material into a more practical course that I teach in the fall, aiming at our professional master's degree students, followed by a more probing advanced graduate course that I or one of my colleagues teaches in the spring, drawing primarily on the original research papers associated with the topics we cover. This works well for us at Cornell, and the organization and focus of the book match with such a sequence.

A final comment regarding references. To avoid encumbering the discussion with a high density of references, the book cites relevant work the first time a reference to it occurs in the text, or where the discussion needs to point to a specific reference, but may not do so in subsequent references to the same work. These can be found in the Bibliography. References are also collected at the end of each chapter into a short section on related reading. It is hard to do adequate justice to such a large and dynamic area of research with a limited number of citations, but every effort has been made to be fair and complete.

Introduction

Despite nearly 20 years of progress toward ubiquitous computer connectivity, distributed computing systems have only recently emerged to play a serious role in industry and society. Perhaps this explains why so few distributed systems are reliable in the sense of tolerating failures automatically, guaranteeing properties such as performance or response time, or offering security against intentional threats. In many ways the engineering discipline of reliable distributed computing is still in its infancy.

One might be tempted by a form of circular reasoning, concluding that reliability must not be all that important in distributed systems (otherwise, the pressure to make such systems reliable would long since have become overwhelming). Yet, it seems more likely that we have only recently begun to see the types of distributed computing systems in which reliability is critical. To the extent that existing mission- and even life-critical applications rely upon distributed software, the importance of reliability has perhaps been viewed as a narrow, domain-specific issue. On the other hand, as distributed software is placed into more and more critical applications, where safety or financial stability of large organizations depends upon the reliable operation of complex distributed applications, the inevitable result will be a growing demand for technology developers to demonstrate the reliability of their distributed architectures and solutions. It is time to tackle distributed system reliability in a serious manner. To fail to do so today is to invite catastrophic computer-system failures tomorrow.

At the time of this writing, the sudden emergence of the World Wide Web (variously called the Web, the Information Superhighway, the Global Information Infrastructure, the Internet, or just the Net) is bringing this issue to the forefront. In many respects, reliability in distributed systems is today tied to the future of the Web and the technology base that has been used to develop it. It is unlikely that any reader of this text is not familiar with the Web technology base, which has penetrated the computing industry in record time. A basic premise of our study is that the Web will be a driver for distributed computing, by creating a mass market around distributed computing. However, the term "Web" is often used loosely: Much of the public sees the Web as a single entity that encompasses all the Internet technologies that exist today and that may be introduced in the future. Thus, when we talk about the Web, we are inevitably faced with a much broader family of communication technologies.

It is clear that some form of critical mass has recently been reached: Distributed computing is emerging from its specialized and very limited niche to become a mass-market commodity--something that literally everyone depends on, such as a telephone or an automobile. The Web paradigm brings together the key attributes of this new market in a single package: easily understandable graphical displays, substantial content, unlimited information to draw upon, and virtual worlds in which to wander and work. But the Web is also stimulating growth in other types of distributed applications. In some intangible way, the experience of the Web has caused modern society to suddenly notice the potential of distributed computing.

Consider the implications of a societal transition whereby distributed computing has suddenly become a mass-market commodity. In the past, a mass-market item was something everyone "owned." With the Web, one suddenly sees a type of commodity that everyone "does." For the most part, the computers and networks were already in place. What has changed is the way people see them and use them. The paradigm of the Web is to connect useful things (and many useless things) to the network. Communication and connectivity suddenly seem to be mandatory: No company can possibly risk arriving late for the Information Revolution. Increasingly, it makes sense to believe that if an application can be put on the network, someone is thinking about doing so, and soon.

Whereas reliability and indeed distributed computing were slow to emerge prior to the introduction of the Web, reliable distributed computing will be necessary if networked solutions are to be used safely for many of the applications that are envisioned. In the past, researchers in the field wondered why the uptake of distributed computing had been so slow. Overnight, the question has become one of understanding how the types of computing systems that run on the Internet and the Web, or that will be accessed through them, can be made reliable enough for emerging critical uses.

If Web-like interfaces present medical status information and records to a doctor in a hospital, or are used to control a power plant from a remote console, or to guide the decision making of major corporations, reliability of those interfaces and applications will be absolutely critical to the users. Some may have life-or-death implications: If a physician bases a split-second decision on invalid data, the patient might die. Other interfaces may be critical to the efficient function of the organization that uses them: If a bank mismanages risk because of an inaccurate picture of how its investments are allocated, the bank could incur huge losses or even fail. In still other settings, reliability may emerge as a key determinant in the marketplace: The more-reliable product, at a comparable price, may simply displace the less-reliable one. Reliable distributed computing suddenly has broad relevance.

Throughout this book, the term "distributed computing" is used to describe a type of computer system that differs from what could be called a "network computing" system. The distinction illuminates the basic issues with which we will be concerned.

As we use the term here, a computer network is a communication technology supporting the exchange of messages among computer programs executing on computational nodes. Computer networks are data movers, providing capabilities for sending data from one location to another, dealing with mobility and changing topology, and automating the division of available bandwidth among contending users. Computer networks have evolved over a 20-year period, and during the mid-1990s network connectivity between computer systems became pervasive. Network bandwidth has also increased enormously, rising from hundreds of bytes per second in the early 1980s to millions of bytes per second in the mid-1990s, with gigabyte rates anticipated in the late 1990s and beyond.

Network functionality evolved steadily during this period. Early use of networks was entirely for file transfer, remote log in, and electronic mail or news. Over time, however, the expectations of users and the tools available have changed. The network user in 1996 is likely to be familiar with interactive network browsing tools, such as Netscape's browsing tool, which permit the user to wander within a huge and interconnected network of multimedia information and documents. Tools such as these permit the user to conceive a computer workstation as a window into an immense world of information, accessible using a great variety of search tools, easy to display and print, and linked to other relevant material that may be physically stored halfway around the world and yet accessible at the click of a mouse.

Meanwhile, new types of networking hardware have emerged. The first generation of networks was built using point-to-point connections: To present the illusion of full connectivity to users, the network included a software layer for routing and connection management. Over time, these initial technologies were largely replaced by high-speed, long-distance lines that route through various hubs, coupled to local area networks implemented using multiple access technologies such as Ethernet and FDDI: hardware in which a single "wire" has a large number of computers attached to it, supporting the abstraction of a shared message bus. At the time of this writing, a third generation of technologies is reaching the market: ATM hardware capable of supporting gigabyte communication rates over virtual circuits, mobile connection technologies for the office that will allow computers to be moved without rewiring, and more ambitious mobile computing devices that exploit the nationwide cellular telephone grid for communications support.

As recently as the early 1990s, computer bandwidth over wide area links was limited for most users. The average workstation had high-speed access to a local network, and perhaps the local e-mail system was connected to the Internet, but individual users (especially those working from PCs) rarely had better than 1,600-baud connections available for personal use of the Internet. This picture is changing rapidly today: More and more users have high-speed modem connections to an Internet service provider that offers megabyte-per-second connectivity to remote servers. With the emergence of ISDN services to the home, the last link of the chain will suddenly catch up with the rest. Individual connectivity has thus jumped from 1,600 baud to perhaps 28,800 baud at the time of this writing, and may jump to 1 Mbaud or more in the not-too-distant future. Moreover, this bandwidth has finally reached the PC community, which enormously outnumbers the workstation community.

It has been suggested that technology revolutions are often spurred by discontinuous, as opposed to evolutionary, improvement in a key aspect of a technology. The bandwidth improvements we are now experiencing are so disproportionate with respect to other performance changes (memory sizes, processor speeds) as to fall squarely into the discontinuous end of the spectrum. The sudden connectivity available to PC users is similarly disproportionate to anything in prior experience. The Web is perhaps just the first of a new generation of communication-oriented technologies enabled by these sudden developments.

In particular, the key enablers for the Web were precisely the availability of adequate long-distance communication bandwidth to sustain its programming model, coupled to the evolution of computing systems supporting high-performance graphical displays and sophisticated local applications dedicated to the user. It is only recently that these pieces fell into place. Indeed, the Web emerged as early as it could possibly have done, considering the state of the art in the various technologies on which it depends. Thus, while the Web is clearly a breakthrough--the "killer application" of the Internet--it is also the most visible manifestation of a variety of underlying developments that are also enabling other kinds of distributed applications. It makes sense to see the Web as the tip of an iceberg: a paradigm for something much broader that is sweeping the entire computing community.

As the trend toward better communication performance and lower latencies continues, it is certain to fuel continued growth in distributed computing. In contrast to a computer network, a distributed computing system refers to computing systems and applications that cooperate to coordinate actions at multiple locations in a network. Rather than adopting a perspective in which conventional (nondistributed) application programs access data remotely over a network, a distributed system includes multiple application programs that communicate over the network, but that take action at the multiple locations where the applications run. Despite the widespread availability of networking since early 1980, distributed computing has only become common in the 1990s. This lag reflects a fundamental issue: Distributed computing turns out to be much harder than nondistributed or network computing applications, especially if reliability is a critical requirement.

Our treatment explores the technology of distributed computing with a particular bias: to understand why the emerging generation of critical Internet and Web technologies is likely to require very high levels of reliability, and to explore the implications of this for distributed computing technologies. A key issue is to gain some insight into the factors that make it so hard to develop distributed computing systems that can be relied upon in critical settings, and to understand what can be done to simplify the task. In other disciplines, such as civil engineering or electrical engineering, a substantial body of practical development rules exist upon which the designer of a complex system can draw to simplify his or her task. It is rarely necessary for the company that builds a bridge to engage in theoretical analyses of stress or basic properties of the materials used, because the theory in these areas has already been reduced to collections of practical rules and formulas that the practitioner can treat as tools for solving practical problems.

This observation motivated the choice of the cover of the book. The Golden Gate Bridge is a marvel of civil engineering that reflects a very sophisticated understanding of the science of bridge building. Although located in a seismically active area, the bridge is believed capable of withstanding even an extremely severe earthquake. It is routinely exposed to violent winter storms: It may sway but is never seriously threatened. And yet the bridge is also esthetically pleasing: It is one of the truly beautiful constructions of its era. Watching the sun set over the bridge from Berkeley, where I attended graduate school, remains among the most memorable experiences of my life. The bridge illustrates that beauty can also be resilient: a fortunate development, since, otherwise, the failure of the Tacoma Narrows Bridge might have ushered in a generation of bulky and overengineered bridges. The achievement of the Golden Gate Bridge illustrates that even when engineers are confronted with extremely demanding standards, it is possible to achieve solutions that are elegant and lovely at the same time as they are resilient. This is only possible, however, to the degree that there exists an engineering science of robust bridge building.

We can build distributed computing systems that are reliable in this sense, too. Such systems would be secure, trustworthy, and would guarantee availability and consistency even when limited numbers of failures occur. Hopefully, these limits can be selected to provide adequate reliability without excessive cost. In this manner, just as the science of bridge building has yielded elegant and robust bridges, reliability need not compromise elegance and performance in distributed computing.

One could argue that in distributed computing, we are today building the software bridges of the Information Superhighway. Yet, in contrast to the disciplined engineering that enabled the Golden Gate Bridge, as one explores the underlying technology of the Internet and the Web, one discovers a disturbing and pervasive inattention to issues of reliability. It is common to read that the Internet (developed originally by the Defense Department's Advanced Research Projects Agency, DARPA) was built to withstand a nuclear war. Today, we need to adopt a similar mindset as we extend these networks into systems that must support tens or hundreds of millions of Web users, as well as a growing number of hackers whose objectives vary from the annoying to the criminal. We will see that many of the fundamental technologies of the Internet and the Web, although completely reasonable in the early days of the Internet's development, have now started to limit scalability and reliability, and the infrastructure is consequently exhibiting troubling signs of stress.

One of the major challenges, of course, is that use of the Internet has begun to expand so rapidly that the researchers most actively involved in extending its protocols and enhancing its capabilities are forced to work incrementally: Only limited changes to the technology base can be contemplated, and even small upgrades can have very complex implications. Moreover, upgrading the technologies used in the Internet is somewhat like changing the engines on an airplane while it is flying. Jointly, these issues limit the ability of the Internet community to move to a more reliable, secure, and scalable architecture. They create a background against which the goals of this book will not easily be achieved.

In early 1995, I was invited by DARPA to participate in an unclassified study concerning the survivability of distributed systems. Participants included academic experts and experts familiar with the state of the art in such areas as telecommunications, power system management, and banking. This study was undertaken against a backdrop colored by the recent difficulties of the Federal Aviation Agency, which launched a project in the late 1980s and early 1990s to develop a new generation of highly reliable distributed air traffic control software. Late in 1994, after losing a huge sum of money and essentially eliminating all distributed aspects of an architecture that was originally innovative precisely for its distributed reliability features, a prototype of the proposed new system was finally delivered, but with such limited functionality that planning of yet another new generation of software had to begin immediately. Meanwhile, article after article in the national press reported on failures of air traffic control systems, many stemming from software problems and several exposing airplanes and passengers to extremely dangerous conditions. Such a situation can only inspire the utmost concern in regard to the practical state of the art.

Although our study did not focus on the FAA's specific experience, the areas we did study are in many ways equally critical. What we learned is that situations encountered by the FAA's highly visible project are occurring, to a greater or lesser degree, within all of these domains. The pattern is one in which pressure to innovate and introduce new forms of products leads to the increasingly ambitious use of distributed computing systems. These new systems rapidly become critical to the enterprise that developed them: Too many interlocked decisions must be made to permit such steps to be reversed. Responding to the pressures of timetables and the need to demonstrate new functionality, engineers inevitably postpone considerations of availability, security, consistency, system management, and fault tolerance--what we call "reliability" in this text--until late in the game, only to find that it is then very hard to retrofit the necessary technologies into what has become an enormously complex system. Yet, when pressed on these issues, many engineers respond that they are merely following common practice: that their systems use the best generally accepted engineering practice and are neither more nor less robust than the other technologies used in the same settings.

Our group was very knowledgeable about the state of the art in research on reliability. So, we often asked our experts whether the development teams in their area were aware of one result or another in the field. What we learned was that research on reliability has often stopped too early to impact the intended consumers of the technologies we developed. It is common for work on reliability to stop after a paper or two and perhaps a splashy demonstration of how a technology can work. But such a proof of concept often leaves open the question of how the reliability technology can interoperate with the software development tools and environments that have become common in industry. This represents a serious obstacle to the ultimate use of the technique, because commercial software developers necessarily work with commercial development products and seek to conform to industry standards.

This creates a quandary: One cannot expect a researcher to build a better version of a modern operating system or communications architecture--such tasks are enormous and even very large companies have difficulty successfully concluding them. So it is hardly surprising that research results are demonstrated on a small scale. Thus, if industry is not eager to exploit the best ideas in an area such as reliability, there is no organization capable of accomplishing the necessary technology transition.

For example, we will look at an object-oriented technology called the Common Object Request Broker Architecture, or CORBA, which has become extremely popular. CORBA is a structural methodology: a set of rules for designing and building distributed systems so that they will be explicitly described and easily managed, and so that components can be interconnected as easily as possible. One would expect that researchers on security, fault tolerance, consistency, and other properties would embrace such architectures, because they are highly regular and designed to be extensible: Adding a reliability property to a CORBA application should be a very natural step. However, relatively few researchers have looked at the specific issues that arise in adapting their results to a CORBA setting (we'll hear about some of the ones that have). Meanwhile, the CORBA community has placed early emphasis on performance and interoperability, while reliability issues have been dealt with primarily by individual vendors (although, again, we'll hear about some products that represent exceptions to the rule). What is troubling is the sense of disconnection between the reliability community and its most likely users, and the implication that reliability is not accorded a very high value by the vendors of distributed system products today.

Our study contributed toward a decision by the Department of Defense (DoD) to expand its investment in research on technologies for building practical, survivable, distributed systems. This DoD effort will focus both on developing new technologies for implementing survivable systems, and on developing new approaches to hardening systems built using conventional distributed programming methodologies, and it could make a big difference. But one can also use the perspective gained through a study such as this one to look back over the existing state of the art, asking to what degree the technologies we already have in hand can, in fact, be applied to the critical computing systems that are already being developed.

As it happened, I started work on this book during the period when this DoD study was underway, and the presentation that follows is strongly colored by the perspective that emerged from it. Indeed, the study has considerably impacted my own research project. I've come to the personal conclusion that the situation could be much better if developers were simply to begin to think hard about reliability and had greater familiarity with the techniques at their disposal today. There may not be any magic formulas that will effortlessly confer reliability upon a distributed system, but, at the same time, the technologies available to us are in many cases very powerful and are frequently much more relevant to even off-the-shelf solutions than is generally recognized. We need more research on the issue, but we also need to try harder to incorporate what we already know how to do into the software development tools and environments on which the majority of distributed computing applications are now based. This said, it is also clear that researchers will need to start paying more attention to the issues that arise in moving their ideas from the laboratory to the field.

Lest these comments seem to suggest that the solution is in hand, it must be understood that there are intangible obstacles to reliability that seem very subtle and yet rather pervasive. Earlier, it was mentioned that the Internet and the Web are in some ways fundamentally unreliable and that industry routinely treats reliability as a secondary consideration, to be addressed only in mature products and primarily in a fire-fighting mode--for example, after a popular technology is somehow compromised by hackers in a visible way. Neither of these problems will be easy to fix, and they combine to have far-reaching implications. Major standards have repeatedly deferred consideration of reliability issues and security until future releases of the standards documents or prototype platforms. The message sent to developers is clear: Should they wish to build a reliable distributed system, they will need to overcome tremendous obstacles, both internal to their companies and in the search for enabling technologies, and they will find relatively little support from the vendors that sell standard computing platforms.

The picture is not uniformly grim, of course. The company I founded in 1988, Isis Distributed Systems, is one of a handful of small technology sources that do offer reliability solutions, often capable of being introduced very transparently into existing applications. (Isis now operates as a division of Stratus Computers, Inc., and my own role is limited to occasional consulting.) However, the big story is that reliability has yet to make much of a dent on the distributed computing market.

The approach of this book is to treat distributed computing technology in a uniform way, looking at the technologies used in developing Internet and Web applications, at emerging standards such as CORBA, and at the technologies available to us for building reliable solutions within these settings. Many books that set this goal would do so primarily through a treatment of the underlying theory, but our approach here is much more pragmatic. By and large, we treat the theory as a source of background information that one should be aware of, but not as the major objective. Our focus, rather, is to understand how and why practical software tools for reliable distributed programming work, and to understand how they can be brought to bear on the broad area of technology currently identified with the Internet and the Web. By building up models of how distributed systems execute, and by using these models to prove properties of distributed communication protocols, we will show how computing systems of this sort can be formalized and reasoned about; however, the treatment is consistently driven by the practical implications of our results.

One of the most serious concerns about building reliable distributed systems stems from more basic issues that underlie any form of software reliability. Through decades of experience, it has become clear that software reliability is a process, not a property. One can talk about design practices that reduce errors, protocols that reconfigure systems to exclude faulty components, testing and quality-assurance methods that lead to increased confidence in the correctness of software, and basic design techniques that tend to limit the impact of failures and prevent them from propagating. All of these improve the reliability of a software system, and presumably would also increase the reliability of a distributed software system. Unfortunately, however, no degree of process ever leads to more than empirical confidence in the reliability of a software system. Thus, even in the case of a nondistributed system, it is hard to say "system X guarantees reliability property Y" in a rigorous manner. This same limitation extends to distributed settings, but is made even worse by the lack of a process comparable to the one used in conventional systems. Significant advances are needed in the process of developing reliable distributed computing systems, in the metrics by which we characterize reliability, the models we use to predict their behavior in new configurations reflecting changing loads or failures, and in the formal methods used to establish that a system satisfies its reliability goals.

For certain types of applications, this creates a profound quandary. Consider the design of an air traffic control software system, which (among other services) provides air traffic controllers with information about the status of air traffic sectors (Figure 1). Web sophisticates may want to think of this system as one that provides a Web-like interface to a database of routing information maintained on a server. Thus, the controller would be presented with a depiction of the air traffic situation, with pushbutton-style interfaces or other case-specific interfaces providing access to additional information about flights, projected trajectories, possible options for rerouting a flight, and so forth. To the air traffic controller these are the commands supported by the system; the Web user might think of them as active hyperlinks. Indeed, even if air traffic control systems were not typical of what the Web is likely to support, other equally critical applications are already moving to the Web, using very much the same programming model.

Figure 1. An idealized client/server system with a backup server for increased availability. The clients interact with the primary server; in an air traffic application, the server might provide information on the status of air traffic sectors, and the clients may be air traffic controllers responsible for routing decisions. The primary server keeps the backup up to date, so that if a failure occurs, the clients can switch to the backup and resume operation with minimal disruption.

controller who depends upon a system such as this needs an absolute assurance that if the service reports that a sector is available and a plane can be routed into it, this information is correct and no other controller has been given the same information in regard to routing some other plane. An optimization criterion for such a service would be that it minimizes the frequency with which it reports a sector as being occupied when it is actually free. A fault-tolerance goal would be that the service remains operational despite limited numbers of failures of component programs, and perhaps that it performs self-checking operations so as to take a component off-line if it somehow falls out of synchronization with regard to the states of other components. Such goals would avoid scenarios such as the one illustrated in Figure 2, where the system state has become dangerously inconsistent as a result of a network failure that fools some clients into thinking the primary server has failed, and similarly fools the primary and backup servers into mutually believing one another to have crashed.

Now, suppose that the techniques of this book were used to construct such a service, using the best available technological solutions, combined with rigorous formal specifications of the software components involved and the best possible quality process. Theoretical results assure us that inconsistencies such as the one in Figure 2 cannot occur. Years of testing might yield a very high degree of confidence in the system, yet the service remains a large, complex software artifact. Even minor changes to the system, such as adding a feature, correcting a very simple bug, or upgrading the operating system version or hardware, could introduce serious problems long after the system was put into production. The question then becomes: Can complex software systems ever be used in critical settings? If so, are distributed systems somehow worse, or are the issues similar?

Figure 2. This figure represents a scenario that will occur in Chapter 4, when we consider the use of a standard remote procedure call methodology to build a client/server architecture for a critical setting. In the case illustrated, some of the client programs have become disconnected from the primary server, perhaps because of a transient network failure (one that corrects itself after a brief period during which message loss rates are very high). In the resulting system configuration, the primary and backup servers each consider themselves to be in charge of the system as a whole. There are two clients still connected to the primary server (black), one to the backup server (white), and one client is completely disconnected (gray). Such a configuration exposes the application user to serious threats. In an air traffic control situation, it is easy to imagine that accidents could occur if such a situation were encountered. The goal of this book is twofold: to assist the reader in understanding why such situations are a genuine threat in modern computing systems and to study the technical options for building better systems that can prevent such situations from occurring. The techniques presented will sometimes have limitations, which we will attempt to quantify and to understand any reliability implications. While many modern distributed systems have overlooked reliability issues, our working hypothesis will be that this situation is changing rapidly and that the developer of a distributed system has no choice but to confront these issues and begin to use technologies that respond to them.

At the core of the material in this book is the consideration seen in this question. There may not be a single answer: Distributed systems are suitable for some critical applications and are not suited for others. In effect, although one can build reliable distributed software, reliability has its limits, and there are problems that distributed software should probably not be used to solve. Even given an appropriate technology, it is easy to build inappropriate solutions--and, conversely, with an inadequate technology, one can sometimes build critical services that are still useful in limited ways. The air traffic example, described previously, might or might not fall into the feasible category, depending on the detailed specification of the system, the techniques used to implement the solution, and the overall process by which the result is used and maintained.

Through the material in this book, the developer will be guided to appropriate design decisions, appropriate development methodologies, and to an understanding of the reliability limits on the solutions that result from this process. No book can expect to instill the sense of responsibility that the reader may need to draw upon in order to make such decisions wisely, but one hopes that computer system engineers, like bridge builders and designers of aircraft, are highly motivated to build the best and most reliable systems possible. Given such a motivation, an appropriate development methodology, and appropriate software tools, extremely reliable distributed software can be implemented and deployed even into critical settings. We will see precisely how this can be done in the following chapters.

Perhaps this book can serve a second purpose while accomplishing its primary one. Many highly placed industry leaders have commented to me that until reliability is forced upon them, their companies will never take the issues involved seriously. The investment needed is simply viewed as very large and likely to slow the frantic rate of progress on which computing as an industry has come to depend. I believe that the tide is now turning in a way that will, in fact, force change, and that this book can contribute to what will, over time, become an overwhelming priority for the industry.

Reliability is viewed as complex and costly, much as the phrase "robust bridge" conjures up a vision of a massive, expensive, and ugly artifact. Yet, the Golden Gate Bridge is robust and is anything but massive or ugly. To overcome this instinctive reaction, it will be necessary for the industry to come to understand reliability as being compatible with performance, elegance, and market success. At the same time, it will be important for pressure favoring reliability to grow, through demand by consumers for more reliable products. Together, such trends would create an incentive for reliable distributed software engineering.

As the general level of demonstrated knowledge concerning how to make systems reliable rises, the expectation of society and government that vendors will employ such technologies is also likely to rise. It will become harder and harder for corporations to cut corners by bringing an unreliable product to market and yet advertise it as "fault tolerant," "secure," or otherwise "reliable." Today, these terms are often used in advertising for products that are not reliable in any meaningful sense at all. One might similarly claim that a building or a bridge was constructed "above code" in a setting where the building code is completely ad hoc. The situation changes considerably when the building code is made more explicit and demanding, and bridges and buildings that satisfy the standard have actually been built successfully (and, perhaps, elegantly and without excessive added cost). In the first instance, a company can easily cut corners; in the second, the risks of doing so are greatly increased.

Moreover, at the time of this writing, vendors often seek to avoid software product liability by using complex contracts that stipulate the unsuitability of their products for critical uses, the near certainty that their products will fail even if used correctly, and in which it is stressed that the customer accepts full responsibility for the eventual use of the technology. It seems likely that as such contracts are put to the test, many of them will be recognized as analogous to those used by a landlord who rents a dangerously deteriorated apartment to a tenant, using a contract that warns of the possibility that the kitchen floor could collapse without warning and that the building is a firetrap lacking adequate escape routes. A landlord could certainly draft such a contract and a tenant might well sign it. But if the landlord fails to maintain the building according to the general standards for a safe and secure dwelling, the courts would still find the landlord liable if the floor indeed collapses. One cannot easily escape the generally accepted standards for one's domain of commercial activity.

By way of analogy, we may see growing pressure on vendors to recognize their fundamental responsibilities to provide a technology base adequate to the actual uses of their technologies, like it or not. Meanwhile, today a company that takes steps to provide reliability worries that in so doing, it may have raised expectations impossibly high and hence exposed itself to litigation if its products fail. As reliability becomes more and more common, such a company will be protected by having used the best available engineering practices to build the most reliable product it was capable of producing. If such a technology does fail, one at least knows that it was not the consequence of some outrageous form of negligence. Viewed in these terms, many of the products on the market today are seriously deficient. Rather than believing it safer to confront a reliability issue using the best practices available, many companies feel that they run a lower risk by ignoring the issue and drafting evasive contracts that hold themselves harmless in the event of accidents.

The challenge of reliability in distributed computing is perhaps the unavoidable challenge of the coming decade, just as performance was the challenge of the past decade. By accepting this challenge, we also gain new opportunities, new commercial markets, and help create a future in which technology is used responsibly for the broad benefit of society. There will inevitably be real limits on the reliability of the distributed systems we can build, and consequently there will be types of distributed computing systems that should not be built because we cannot expect to make them adequately reliable. However, we are far from those limits; in many circumstances we are deploying technologies known to be fragile in ways that actively encourage their use in critical settings. Ignoring this issue, as occurs too often today, is irresponsible and dangerous and increasingly unacceptable. Reliability challenges us as a community: It now falls upon us to respond.