|
|
|
|
|
|
|
Error Handling Mechanisms for Real-Time Software
|
|
Faculty:
| D. Stewart
|
|
Students:
|
Tom Carley,
Jun Lang
|
|
Objective
|
Error handling is needed to ensure reliable and predictable operation of
embedded systems in the presence of errors. Errors can result from hardware
failures, software bugs, discrepancies between the external environment and
the internal representation, or from improper timing and synchronization.
Current error handling mechanisms, however, only address the needs of
non-real-time systems, as highlighted in our recent survey. This project
addresses error handling in real-time systems.
|
Motivation
|
Studies show that up to 70 percent of the code in reliable software systems
can be for detecting and handling errors and other exceptional cases. The
focus of developing real-time software, however, has been on the structure
of initialization and normal operation code within processes. The first goal
of this project is to address the shortcomings of existing methods, and create
error handling mechanisms suitable for real-time systems, both in embedded
environments and on symmetric multiprocessor (SMP) clusters.
There has been very little research in error handling for real-time systems,
mainly because the problem is not well understood. Ironically, many of the
most critical embedded systems have very little error handling. When we asked
the engineers why not, they responded, "if an error occurs, we don't even know
what to do with it, so what is the point of even detecting it." Critical
systems often resort to expensive replication of hardware to support fault
tolerance, or they fail in a fatal manner, causing loss of human life and
major loss of equipment. The second goal of this project is to gain an
understanding of errors in embedded systems. We will implement embedded system
applications that we can intentionally create every error imaginable, yet
still maintain a safe environment to work in, and not cause any costly damage
to the equipment.
|
Approach
|
It is difficult to use most error handling mechanisms. The Chimera RTOS
provides global error handling, which improves error handling capabilities of
C code similar to the C++ throw and catch mechanism, but at the operating
system level. However, we have found that neither our Chimera's global error
handling nor the throw and catch mechanism in C++ are satisfactory for
achieving high reliability in the presence of errors in real-time systems.
In our survey on exception handling mechanisms, we found approaches suitable
for embedded applications that are used by current mechanisms for exception
representation, handler binding, exception raising, handling action,
information passing, handler scope, resource cleanup, exception interface,
and reliability checks. None of the existing approaches for handler
determination, criticality management, propagation of errors, and handler
reuse, however, are sufficient for real-time systems, while only the mechanism
we previously developed for Chimera provided mechanisms for detecting and
handling timing errors. The following major issues need to be
addressed in order to create a mechanism suitable for real-time software:
- O(1) time-bounded handler determination: In existing mechanisms, the
worst-case time to select the appropriate error handler in response to
detection of an error is a function of the number of handlers installed in the
system. Determination of the handler must be time-bounded if a real-time system
is to remain predictable even in the presence of errors.
- External handling of errors: Programming language mechanisms do not
have the concept of processes, and therefore no concept of transferring errors
to different processes. Current operating system-based mechanisms are aware of
processes, but rely on the user to handle the error internally, then send a
message or signal to a different process. An embedded software framework
should support transparent external handling.
- Criticality management: To prevent priority inversion, handlers for
less critical errors must not preempt execution of high-priority tasks. Only
timing error detection and handling in Chimera provides some priority control.
Note the mechanism does not work for non-timing-related errors.
- Handler reuse: Just as it is desirable to reuse and reconfigure
main-body code, which we achieve through port-based objects, there should
exist the ability to reuse and reconfigure error handlers dynamically. The
concept of error handler reconfiguration has not yet been addressed.
We have recently addressed the time-bounded handler determination issue; our
design is described in [1] Part of
our research plan is to address the remaining issues. As a longer range
plan, we will integrate the mechanism with Chimera's timing error
detection and handling mechanism.
To address our second goal, we will build an embedded systems testbed
consisting of a complex computer-controlled electric train layout that
includes sections of both parallel and shared track. Every section of track
and the track intersections are independently controlled. Up to 64 optical
and mechanical sensors are scattered around the track to detect the locations
of multiple trains on the tracks.
Once the hardware and software to control multiple trains is finished, we will
experiment with various methods to detect and handle errors. The train
environment is ideal for this research, as it is safe for researchers to work
with, it has a low cost, and it allows one to easily create a wide variety of
errors. Examples of errors that can be injected include trains derailing or
taking too long to arrive at destinations, detecting impending collisions,
diagnosing broken wires and bad sensors, and continued execution in the
presence of software inconsistencies. In comparison, embedded applications
such as flight control systems and cruise control in automobiles require much
higher cost, and injecting errors in these systems can be dangerous, while
embedded systems such as satellite communication are very difficult to test
for all possible errors.
|
Experimental Testbeds
|
To experimentally validate new real-time error detection and handling strategies, we are using the computer-controlled electric train testbed.
|
Impact
|
Mechanisms for error detection and handling in real-time systems, along with a
better understanding on how to use these mechanisms, can prove to be
invaluable in saving lives and money. It can significantly increase the
reliability of embedded systems, even in systems requiring fault tolerance.
The embedded system testbed we will build also provides a means for educating
engineers in the design of error handling, so that their first experience with
errors is not on critical applications.
|
Relevant Publications
|
(Click on reference numbers for abstract/project description.)
| [1]
|
A Distributed and Time-bounded Exception Handling Mechanism for
Dimensionally Reconfigurable Real-time Software
|
| [2]
|
Real-Time Software Design and Analysis of Reconfigurable Multi-Sensor
Based Systems
|
| [3]
|
Mechanisms for Detecting and Handling Timing Errors
|
|
|
|
|
© 1999 University of Maryland, College Park, MD 20742.
All Rights Reserved.
For more information on the SERTS Laboratory, contact
Dr. D. Stewart at
dstewart@eng.umd.edu
|
|