Introduction

By Len Bass, Qinghua Lu, Ingo Weber, Liming Zhu
Feb 24, 2026

📄 Contents

␡

1.1 What We Talk about When We Talk about Things: Terminology
1.2 Achieving System Qualities
1.3 Life-Cycle Processes
1.4 Software Architecture
1.5 AI Model Quality
1.6 Dealing with Uncertainty
1.7 Summary
1.8 Discussion Questions
1.9 For Further Reading

⎙ Print

< Back Page 3 of 9 Next >

This chapter is from the book 

Engineering AI Systems: Architecture and DevOps Essentials

Learn More Buy

1.3 Life-Cycle Processes

We start with an overview of the processes involved in engineering an AI system, which are structured in a life cycle. This initial discussion is followed by more details in the remainder of this chapter, and yet more details are provided in the other chapters of the book.

A system has a life cycle. The individual components are developed and tested in isolation. They are then integrated to form an executable version of the system. This executable is tested; if it passes the tests, it is deployed to production. The deployed system is then operated. If problems occur during operation, the problems must be fixed, either directly in the running system or through releasing updates to the system.

Very different development techniques are used to create the AI portion and the non-AI portion of the system. The AI portion development techniques involve a variety of specialized model preparation, model training, and model testing techniques. The two portions are brought together to build an executable artifact. Once an executable is built, the system progresses through a pipeline that involves the heavy use of tools to test the executable and deploy the result. Chapters 2 and 6 describe this process in more detail.

Figure 1.2 depicts the life cycle, with the processes being roughly arranged in two overlapping circles. The main circle lists the development processes for the overall system; the AI model processes are shown in a smaller, half-filled circle connected to the system development and build processes. The main circle is entered via the Design arrow, at the bottom right of the figure.

Figure 1.2 Life-cycle processes for engineering an AI system.

Figure 1.2 depicts the various processes involved in the life cycle of engineering AI-based systems. There is an initial design phase, where decisions about which functionalities will be accomplished by the AI portion and which by the non-AI portion are made. This phase is also where the software architecture of the system is designed, with the designers aiming to meet the requirements and goals for the system being constructed. The software architecture design will embody the resource choices and, in turn, the resource requirements for the total system. We discuss software architectures in Section 2.1.2, “Distributed Software Architectures.”

The model development stage focuses on the selection, exploration, training, and tuning of the AI models. It includes tasks such as model selection, hyperparameter tuning, training, or fine-tuning and testing the models to achieve optimal performance. The goal is to create a well-performing and accurate model. The model can be developed in parallel with the system development stage or before the system development stage. Close collaboration between the teams involved in the non-AI development portion and the AI model development can help to avoid problems and lead to a smoother integration. We discuss this stage further in Section 6.1, “Design.”

The Dev stage involves performing normal development activities and creating scripts for other activities related to the AI system. It includes tasks such as developing additional functionalities (the non-AI parts), optimizing performance, and ensuring compatibility with other system components.

The system build stage focuses on creating executable artifacts for the entire system. This involves transforming the system or its parts into a deployable format that can be executed within the production environment. Both the AI portion and the non-AI portion are integrated and included in the executable artifact that is the output of the build stage. We discuss this stage in Section 6.3, “Build.”

1.3.1 Testing a System

After the system is built, it needs to be thoroughly tested utilizing automated tests as much as possible. The system test stage involves evaluating the system’s performance, functionality, and reliability through various testing techniques. The following types of testing are often performed:

Regression testing
Smoke testing
Compatibility testing
Integration testing
Functional testing
Usability testing
Install/uninstall testing
Quality testing

Testing is covered in more detail in Chapter 6, while Chapters 7–11 cover specific qualities and how to test for them.

Once the system has been tested and approved, it can be released for deployment. The system release stage involves finalizing the system for deployment in the production environment and ensuring its readiness for operation. This involves final quality gates, which may be automated or include manual activities.

1.3.2 Deploying a System

The next stage of the life cycle is deployment, which involves moving the system into the production environment. This includes setting up the necessary infrastructure, applying the configuration of the system, and ensuring a smooth transition from the old version of the production system to the new version.

Both the AI and non-AI portions of the system will be updated over time. One strategy to install updates is to shut down the system for a period. An alternative strategy is to perform live updating—that is, to install the changes without shutting down the system. Either option requires architectural support. The choice is a business decision, not a technical one. Once the business decision has been made, then the architecture must be designed to support it. We discuss these techniques further in Chapter 6.

1.3.3 Operating and Monitoring a System

Once the system is deployed, it enters the operation and monitoring stage. During this stage, the system is executed, and measurements are gathered about its operation. Measurements are gathered by monitoring the system. Monitoring serves multiple purposes:

Determine whether the overall system is meeting its performance and availability goals.
Determine whether the AI portion is meeting its AI-specific quality goals.
Ensure sufficient resources are available for all parts of the system, and identify unnecessary (quantities of) resources that can be shut down.
Determine areas for improvement through redesign, code improvement, retraining, or other means.

The data provided to the monitoring system can be event, log, or metric data from various components, or it could be input or output of an AI model. In any case, the architecture should be designed to include such a monitoring component, a process for creating and modifying the rules used to generate alerts, and means to deliver the alerts to specified locations.

The monitoring may be achieved in two different ways:

Instrumentation and an external system that periodically collects data from the various components of the system. The external component will generate alerts based on a set of rules (e.g., if available disk space is critically low).
A dedicated portion of the system under construction.

One use of the monitoring system is to evaluate AI model performance in production, and subsequently to trigger retraining, fallback, or other adaptive mechanisms as necessary. AI models used from cloud providers through APIs may be subject to varying performance or bandwidth limitations that might cause the system’s performance to deteriorate. Such issues can be detected through monitoring, though fewer options are available to address them at this point than when the AI model was initially trained.

Based on insights from DevOps, we want to emphasize that the monitoring mechanisms are designed into the architecture and implemented and tested during the building of the system. They do not happen automatically. Integrating them into the design of the system is easiest when the specific monitoring requirements are known early on.

1.3.4 Analyzing a System

The final stage is to analyze the system’s performance. This involves displaying measurements taken during operation and monitoring, and analyzing the data to gain insights into the system’s behavior and performance. This analysis can help identify areas for improvement and guide future development efforts.

The life cycle depicted in Figure 1.2 presents a bird’s-eye view of the comprehensive set of processes for engineering AI-based systems for people involved in the development, deployment, and operation of such systems. These individuals may be architects, developers, operators, or anyone in a blend of these roles.

Now we discuss the design stage in more detail.

< Back Page 3 of 9 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address