Programmer's Life: Fault Tolerance

< Free Open Study >

Fault Tolerance

Fault tolerance is the intrinsic ability of a software system to continuously deliver service to its users in the presence of faults. This approach to software reliability addresses how to keep a system functioning after the faults in the delivered system manifest themselves. The implementation of software fault tolerance is dramatically different from that of hardware. In a hardware fault tolerant system, a second or third complete set of hardware is running in parallel, shadowing the execution of the main processor. All of the mass-storage and mass-memory devices are mirrored so that if one fails, another immediately picks up the application. This is addressing the faults shown in the bathtub curve�hardware wearing out.

Trying to address software fault tolerance in the same fashion�parallel operation of the same software on a different processor�only results in the second copy of the exact same software failing a millisecond after the first copy. Simply running a separate copy of the application does nothing for software fault tolerance.

From the middle phases of the software development life cycle through product delivery and maintenance, reliability efforts focus on fault tolerance. Table 20-6 shows the subprocesses of the major design, implementation, installation, delivery, and maintenance life cycle phases that support fault tolerance.

Fault tolerance begins at the implementation product development phase and extends through installation, operations and support, and maintenance to final product retirement. As long as the software is running in a production mode, the fault tolerance approach to reliability is useful.

Design and implementation and installation have been discussed in the fault removal approach to software reliability. In fault tolerance, operations, support, and maintenance are added to complete this approach to software reliability. Fault tolerance is a follow-on to fault removal. All of the processes used in fault removal are used in this approach. The differences are in the focus on the product life cycle after installation. The projection of post-release staff needs can only be done with reference to historic information from previous products. The organization needs to have an available database of faults that were discovered in other products after installation. This database of faults and the effort taken to manage and repair them is used to estimate how much post-release effort will be required on the just-released product. Using the historic metrics on faults discovered by the development phase and the relative size of the new product compared to others, a quick estimate can be made of faults remaining and effort required to fix the new product.

Table 20-6. Fault Tolerance Life Cycle Activities
	Tolerance
Design and Implementation
Allocate reliability among components	X
Engineer to meet reliability objectives	X
Focus resources based on functional profile	X
Manage fault introduction and propagation	X
Measure reliability of acquired software	X
Installation
Determine operational profile	X
Conduct reliability growth testing	X
Track testing progress	X
Project additional testing needed	X
Certify reliability objectives are met	X
Operations, Support, and Maintenance
Project post-release staff needs	X
Monitor field reliability versus objectives	X
Track customer satisfaction with reliability	X
Time new feature introduction by monitoring reliability	X
Guide product and process improvement with reliability measures	X

In order to provide a set of data for future products, the project or product manager must monitor field reliability versus reliability objectives. This is tied into the tracking of customer satisfaction with reliability objectives. The end-user is the best source of reliability information on the software product. This is where fault tolerance predictions meet the reality of the real world. End-users can neither be predicted nor directed. Therefore, estimates and assumptions made early in fault forecasting can only be validated through the fault tolerance approach to reliability.

The project/product manager must time new feature introduction by monitoring reliability. It is not a good idea to release new features to customers while known faults still reside in the software product. Combining the release of new feature sets with fault-fixes is an appropriate practice for software product organizations. Guiding product and process improvement with reliability measures feeds the information gathered from customer experience back into product fault removal and continuous development process improvement. Reliability measures are expensive to institute. Their results must be captured and fed back into the learning organization.

< Free Open Study >

Programmer's Life

Friday, November 13, 2009

Fault Tolerance

Fault Tolerance

Table 20-6. Fault Tolerance Life Cycle Activities

No comments:

Blog Archive

About Me

Link