Properties of Real-Time Embedded Systems
Of course, software development is hard. Embedded software development is harder. Real-time embedded software is even harder than that. This is not to minimize the difficulty in reliably developing application software, but there are a host of concerns with real-time and embedded systems that don’t appear in the production of typical applications.
An embedded system is one that contains at least one CPU but does not provide general computing services to the end users. A cell phone is considered an embedded computing platform because it contains one or more CPUs but provides a dedicated set of services (although the distinction is blurred in many contemporary cell phones). Our modern society is filled with embedded computing devices: clothes washers, air traffic control computers, laser printers, televisions, patient ventilators, cardiac pacemakers, missiles, global positioning systems (GPS), and even automobiles—the list is virtually endless.
The issues that appear in real-time embedded systems manifest themselves on four primary fronts. First, the optimization required to effectively run in highly resource-constrained environments makes embedded systems more challenging to create. It is true that embedded systems run the gamut from 8-bit processes in dishwashers and similar machinery up to collaborating sets of 64-bit computers. Nevertheless, most (but not all) embedded systems are constrained in terms of processor speed, memory, and user interface (UI). This means that many of the standard approaches to application development are inadequate alone and must be optimized to fit into the computing environment and perform their tasks. Thus embedded systems typically require far more optimization than standard desktop applications. I remember writing a real-time operating system (RTOS) for a cardiac pacemaker that had 32kB of static memory for what amounted to an embedded 6502 processor.9 Now that’s an embedded system!
Along with the highly constrained environments, there is usually a need to write more device-driver-level software for embedded systems than for standard application development. This is because these systems are more likely to have custom hardware for which drivers do not exist, but even when they do exist, they often do not meet the platform constraints. This means that not only must the primary functionality be developed, but the low-level device drivers must be written as well.
The real-time nature of many embedded systems means that predictability and schedulability affect the correctness of the application. In addition, many such systems have high reliability and safety requirements. These characteristics require additional analyses, such as schedulability (e.g., rate monotonic analysis, or RMA), reliability (e.g., failure modes and effects analysis, or FMEA), and safety (e.g., fault tree analysis, or FTA) analysis. In addition to “doing the math,” effort must be made to ensure that these additional requirements are met.
Last, a big difference between embedded and traditional applications is the nature of the so-called target environment—that is, the computing platform on which the application will run. Most desktop applications are “hosted” (written) on the same standard desktop computer that serves as the target platform. This means that a rich set of testing and debugging tools is available for verifying and validating the application. In contrast, most embedded systems are “cross-compiled” from a desktop host to an embedded target. The embedded target lacks the visibility and control of the program execution found on the host, and most of the desktop tools are useless for debugging or testing the application on its embedded target. The debugging tools used in embedded systems development are almost always more primitive and less powerful than their desktop counterparts. Not only are the embedded applications more complex (due to the optimization), and not only do they have to drive low-level devices, and not only must they meet additional sets of quality-of-service (QoS) requirements, but the debugging tools are far less capable as well.
It should be noted that another difference exists between embedded and “IT” software development. IT systems are often maintained systems that constantly provide services, and software work, for the most part, consists of small incremental efforts to remove defects and add functionality. Embedded systems differ in that they are released at an instant in time and provide functionality at that instant. It is a larger effort to update embedded systems, so that they are often, in fact, replaced rather than being “maintained” in the IT sense. This means that IT software can be maintained in smaller incremental pieces than can embedded systems, and “releases” have more significance in embedded software development.
A “real-time system” is one in which timeliness is important to correctness. Many developers incorrectly assume that “real-time” means “real fast.” It clearly does not. Real-time systems are “predictably fast enough” to perform their tasks. If processing your eBay order takes an extra couple of seconds, the server application can still perform its job. Such systems are not usually considered real-time, although they may be optimized to handle thousands of transactions per second, because if the system slows down, it doesn’t affect the system’s correctness. Real-time systems are different. If a cardiac pacemaker fails to induce current through the heart muscle at the right time, the patient’s heart can go into fibrillation. If the missile guidance system fails to make timely corrections to its attitude, it can hit the wrong target. If the GPS satellite doesn’t keep a highly precise measure of time, position calculations based on its signal will simply be wrong.
Real-time systems are categorized in many ways. The most common is the broad grouping into “hard” and “soft.” “Hard” real-time systems exhibit significant failure if every single action doesn’t execute within its time frame. The measure of timeliness is called a deadline—the time after action initiation by which the action must be complete. Not all deadlines must be in the microsecond time frame to be real-time. The F2T2EA (Find, Fix, Track, Target, Engage, Assess) Kill Chain is a fundamental aspect of almost all combat systems; the end-to-end deadline for this compound action might be on the order of 10 minutes, but pilots absolutely must achieve these deadlines for combat effectiveness.
The value of the completion of an action as a function of time is an important concept in real-time systems and is expressed as a “utility function” as shown in Figure 1.1. This figure expresses the value of the completion of an action to the user of the system. In reality, utility functions are smooth curves but are most often modeled as discontinuous step functions because this eases their mathematical analysis. In the figure, the value of the completion of an action is high until an instant in time, known as the deadline; at this point, the value of the completion of the action is zero. The length of time from the current time to the deadline is a measure of the urgency of the action. The height of the function is a measure of the criticality or importance of the completion of the action. Criticality and urgency are important orthogonal properties of actions in any real-time system. Different scheduling schemas optimize urgency, others optimize importance, and still others support a fairness (all actions move forward at about the same rate) doctrine.
Figure 1.1 Utility function
Actions are the primitive building blocks of concurrency units, such as tasks or threads. A concurrency unit is a sequence of actions in which the order is known; the concurrency unit may have branch points, but the sequence of actions within a set of branches is fully deterministic. This is not true for the actions between concurrency units. Between concurrency units, the sequence of actions is not known, or cared about, except at explicit synchronization points.
Figure 1.2 illustrates this point. The flow in each of the three tasks (shown on a UML activity diagram) is fully specified. In Task 1, for example, the sequence is that Action A occurs first, followed by Action B and then either Action C or Action D. Similarly, the sequence for the other two tasks is fully defined. What is not defined is the sequence between the tasks. Does Action C occur before or after Action W or Action Gamma? The answer is You don’t know and you don’t care. However, we know that before Action F, Action X, and Action Zeta can occur, Action E, Action Z, and Action Gamma have all occurred. This is what is meant by a task synchronization point.
Figure 1.2 Concurrency units
Because in real-time systems synchronization points, as well as resource sharing, are common, they require special attention in real-time systems not often found in the development of IT systems.
Within a task, several different properties are important and must be modeled and understood for the task to operate correctly (see Figure 1.3). Tasks that are time-based occur with a certain frequency, called the period. The period is the time between invocations of the task. The variation around the period is called jitter. For event-based task initiation, the time between task invocations is called the interarrival time. For most schedulability analyses, the shortest such time, called the minimum interarrival time, is used for analysis. The time from the initiation of the task to the point at which its set of actions must be complete is known as the deadline. When tasks share resources, it is possible that a needed resource isn’t available. When a necessary resource is locked by a lower-priority task, the current task must block and allow the lower-priority task to complete its use of the resource before the original task can run. The length of time the higher-priority task is prevented from running is known as the blocking time. The fact that a lower-priority task must run even though a higher-priority task is ready to run is known as priority inversion and is a property of all priority-scheduled systems that share resources among task threads. Priority inversion is unavoidable when tasks share resources, but when uncontrolled, it can lead to missed deadlines. One of the things real-time systems must do is bound priority inversion (e.g., limit blocking to the depth of a single task) to ensure system timeliness. The period of time that a task requires to perform its actions, including any potential blocking time, is called the task execution time. For analysis, it is common to use the longest such time period, the worst-case execution time, to ensure that the system can always meet its deadlines. Finally, the time between the end of the execution and the deadline is known as the slack time. In real-time systems, it is important to capture, characterize, and manage all these task properties.
Figure 1.3 Task time
Real-time systems are most often embedded systems as well and carry those burdens of development. In addition, real-time systems have timeliness and schedulability constraints. Real-time systems must be timely—that is, they must meet their task completion time constraints. The entire set of tasks is said to be schedulable if all the tasks are timely. Real-time systems are not necessarily (or even usually) deterministic, but they must be predictably bounded in time. Methods exist to mathematically analyze systems for schedulability,10 and there are tools11 to support that analysis.
Safety-critical and high-reliability systems are special cases of real-time and embedded systems. The term safety means “freedom from accidents or losses”12 and is usually concerned with safety in the absence of faults as well as in the presence of single-point faults. Reliability is usually a stochastic measure of the percentage of the time the system delivers services.
Safety-critical systems are real-time systems because safety analysis includes the property of fault tolerance time—the length of time a fault can be tolerated before it leads to an accident. They are almost always embedded systems as well and provide critical services such as life support, flight management for aircraft, medical monitoring, and so on. Safety and reliability are assured through the use of additional analysis, such as FTA, FMEA, failure mode, effects, and criticality analysis (FMECA), and often result in a document called the hazard analysis that combines fault likelihood, fault severity, risk (the product of the previous two), hazardous conditions, fault protection means, fault tolerance time, fault detection time, and fault protection action time together. Safety-critical and high-reliability systems require additional analysis and documentation to achieve approval from regulatory agencies such as the FAA and FDA.
It is not at all uncommon for companies and projects to specify very heavyweight processes for the development of these kinds of systems—safety-critical, high-reliability, real-time, or embedded—as a way of injecting quality into those systems. And it works, to a degree. However, it works at a very high cost. Agile methods provide an alternative perspective on the development of these kinds of systems that is lighter-weight but does not sacrifice quality.