Error Handling Mechanisms for Real-Time Software

Faculty:

D. Stewart
Students: Tom Carley, Jun Lang

Objective

Error handling is needed to ensure reliable and predictable operation of embedded systems in the presence of errors. Errors can result from hardware failures, software bugs, discrepancies between the external environment and the internal representation, or from improper timing and synchronization. Current error handling mechanisms, however, only address the needs of non-real-time systems, as highlighted in our recent survey. This project addresses error handling in real-time systems.

Motivation

Studies show that up to 70 percent of the code in reliable software systems can be for detecting and handling errors and other exceptional cases. The focus of developing real-time software, however, has been on the structure of initialization and normal operation code within processes. The first goal of this project is to address the shortcomings of existing methods, and create error handling mechanisms suitable for real-time systems, both in embedded environments and on symmetric multiprocessor (SMP) clusters.

There has been very little research in error handling for real-time systems, mainly because the problem is not well understood. Ironically, many of the most critical embedded systems have very little error handling. When we asked the engineers why not, they responded, "if an error occurs, we don't even know what to do with it, so what is the point of even detecting it." Critical systems often resort to expensive replication of hardware to support fault tolerance, or they fail in a fatal manner, causing loss of human life and major loss of equipment. The second goal of this project is to gain an understanding of errors in embedded systems. We will implement embedded system applications that we can intentionally create every error imaginable, yet still maintain a safe environment to work in, and not cause any costly damage to the equipment.

Approach

It is difficult to use most error handling mechanisms. The Chimera RTOS provides global error handling, which improves error handling capabilities of C code similar to the C++ throw and catch mechanism, but at the operating system level. However, we have found that neither our Chimera's global error handling nor the throw and catch mechanism in C++ are satisfactory for achieving high reliability in the presence of errors in real-time systems.

In our survey on exception handling mechanisms, we found approaches suitable for embedded applications that are used by current mechanisms for exception representation, handler binding, exception raising, handling action, information passing, handler scope, resource cleanup, exception interface, and reliability checks. None of the existing approaches for handler determination, criticality management, propagation of errors, and handler reuse, however, are sufficient for real-time systems, while only the mechanism we previously developed for Chimera provided mechanisms for detecting and handling timing errors. The following major issues need to be addressed in order to create a mechanism suitable for real-time software:

  • O(1) time-bounded handler determination: In existing mechanisms, the worst-case time to select the appropriate error handler in response to detection of an error is a function of the number of handlers installed in the system. Determination of the handler must be time-bounded if a real-time system is to remain predictable even in the presence of errors.
  • External handling of errors: Programming language mechanisms do not have the concept of processes, and therefore no concept of transferring errors to different processes. Current operating system-based mechanisms are aware of processes, but rely on the user to handle the error internally, then send a message or signal to a different process. An embedded software framework should support transparent external handling.
  • Criticality management: To prevent priority inversion, handlers for less critical errors must not preempt execution of high-priority tasks. Only timing error detection and handling in Chimera provides some priority control. Note the mechanism does not work for non-timing-related errors.
  • Handler reuse: Just as it is desirable to reuse and reconfigure main-body code, which we achieve through port-based objects, there should exist the ability to reuse and reconfigure error handlers dynamically. The concept of error handler reconfiguration has not yet been addressed.
We have recently addressed the time-bounded handler determination issue; our design is described in [1] Part of our research plan is to address the remaining issues. As a longer range plan, we will integrate the mechanism with Chimera's timing error detection and handling mechanism.

To address our second goal, we will build an embedded systems testbed consisting of a complex computer-controlled electric train layout that includes sections of both parallel and shared track. Every section of track and the track intersections are independently controlled. Up to 64 optical and mechanical sensors are scattered around the track to detect the locations of multiple trains on the tracks.

Once the hardware and software to control multiple trains is finished, we will experiment with various methods to detect and handle errors. The train environment is ideal for this research, as it is safe for researchers to work with, it has a low cost, and it allows one to easily create a wide variety of errors. Examples of errors that can be injected include trains derailing or taking too long to arrive at destinations, detecting impending collisions, diagnosing broken wires and bad sensors, and continued execution in the presence of software inconsistencies. In comparison, embedded applications such as flight control systems and cruise control in automobiles require much higher cost, and injecting errors in these systems can be dangerous, while embedded systems such as satellite communication are very difficult to test for all possible errors.

Experimental Testbeds

To experimentally validate new real-time error detection and handling strategies, we are using the computer-controlled electric train testbed.

Impact

Mechanisms for error detection and handling in real-time systems, along with a better understanding on how to use these mechanisms, can prove to be invaluable in saving lives and money. It can significantly increase the reliability of embedded systems, even in systems requiring fault tolerance. The embedded system testbed we will build also provides a means for educating engineers in the design of error handling, so that their first experience with errors is not on critical applications.

Relevant Publications

(Click on reference numbers for abstract/project description.)
[1] A Distributed and Time-bounded Exception Handling Mechanism for Dimensionally Reconfigurable Real-time Software
[2] Real-Time Software Design and Analysis of Reconfigurable Multi-Sensor Based Systems
[3] Mechanisms for Detecting and Handling Timing Errors


Mission
Research
Education
Publications
Personnel
Resources
Sponsors
Awards
SERTS Home
Site Map
UMIACS
ISR
ECE Dept
UMD
School of Engr
Search UMD

© 1999 University of Maryland, College Park, MD 20742. All Rights Reserved.
For more information on the SERTS Laboratory, contact Dr. D. Stewart at
dstewart@eng.umd.edu