Graphics Programming in the Modern Age
One advance in technology that contributed to the demise of QuickDraw was the trend in modern personal computers to move more of the graphics capabilities of the system to video cards. Indeed the advent of video cards with dedicated graphics processing units has ushered in a new age of computer graphics. Quartz 2D and Core Image take advantage of these recent developments to improve their functionality and performance. Core Image in particular is a direct benefactor of the power of modern graphics hardware. The evolution of the graphics system in personal computers is extending the reach of those machines to new and exciting fields of endeavor. The Mac OS X graphics system is at the forefront of this technologies wave. By using modern graphics APIs like Quartz 2D, your application can take advantage of the work Apple has done, and you can enjoy the benefits of the hardware while concentrating on a simple interface.
By way of an example, consider the impact that modern personal computers have had on video production. We live in an age where studios use computer graphics to create full-length, animated feature films. In the past, digital video production houses used expensive, dedicated workstations to produce their films.
Just as PostScript shifted print publishing from proprietary systems to the desktop, the development of software such as Final Cut Pro has professional quality video editing onto consumer computers. This transition works because of a combination of hardware improvements and the advancement of the graphics systems on personal computers.
In recent past, images that used 32 bits per pixel and alpha channels could only be manipulated by high-end applications. Today, however, these images are commonplace. The CCT chips in modern digital cameras can capture images using 12 bits per color channel or more. Storing these images in an 8 bit per channel image drops valuable color information. The high-end applications of today may choose to use a full 32-bit floating point value to represent just one color channel. Each pixel, therefore, requires 128 bits. An image with the same dimensions may require four times the storage just to hold the additional color information! Processing such an image requires the computer to sift through four times the data.
Shuffling around large volumes of pixel data is one difficulty. The color channels in these images are stored in floating point representations. Correspondingly, performing calculations on those pixels requires floating point math. Computing at this level requires significant processing horsepower and efficient use of graphics resources. Computer scientists have answered the demand for greater graphics processing power by adding dedicated computer graphics hardware to personal computers.
General Purpose Vector Processors for Graphics
A good example of the evolution of hardware with a corresponding impact on graphics is the addition of vector processing units to general purpose microprocessors. On the Macintosh platform, for example, the G4 and G5 PowerPC processors have a vector processor known as the Velocity Engine. Developers will recognize it by its geeky name, AltiVec. Intel-based processors include SIMD technologies like MMX or SSE.
AltiVec will serve as a good example of a general purpose vector processor. The registers of the AltiVec unit store quantities that are 128 bit wide. The processors instructions treat those bits as vectors. Depending on the instruction, the processor will interpret the 128 bits as a vector whose components have different lengths. Figure 2.1 shows the different ways that AltiVec processors can interpret 128 bits.
Figure 2.1 AltiVec Register Configurations
When interpreting a vector of four 32-bit values, the processor can treat those bits as either a 32-bit integer or a 32-bit floating point value. A graphics application might feed the AltiVec unit with 16 pixels of an 8-bit grayscale image all at once. The program could then lighten all 16 pixels at once using a single AltiVec command.
Using different instructions, a program could also load an AltiVec register with the four 32-bit floating point numbers that make up a single floating point ARGB pixel. The AltiVec processor could combine two floating point ARGB pixels in a single operation. From these two examples, it’s easy to see how the processing muscle of a vector unit like AltiVec can improve graphics performance.
One shadow that complicates the use of vector processing units for graphics is the fact that the AltiVec unit is not dedicated to graphics alone. Computer games are popular clients of the graphics system. Many games contain computing engines that handle physics calculations. Physics involves working with vector-valued quantities like velocities and accelerations. These calculations are also a good fit for implementation with the vector processor. Scientific visualization applications also rely heavily on the graphics system and include algorithms that benefit from the vector processor. In many applications, the vector processor is shared between the graphics systems and other computation engines.
Computers with two or more microprocessors often have the added luxury of a second vector processing unit, and applications can employ that to alleviate some of the congestion. Unfortunately, there is a practical limit that prevents computers from scaling performance through the addition of processing units. The complexity of a general-purpose vector processor means that they also take up quite a lot of space on silicon chips. Correspondingly, the amount of power they require and the amount of heat they generate increases with each additional unit. Issues like these make it very hard to scale the performance of a general vector processor by simply adding additional cores. As will be shown, however, if we reduce the complexity of the core, limit the operations it can perform, and focus it on a specific task, the idea of scaling vector processing power this way actually works quite well.
The Emergence of the GPU
Another common technique to boost the graphics performance of a computer is to augment the CPU with an additional processor that is dedicated graphics. On the personal computer systems of today, that additional processor is usually found on the computer’s video card.
Many of the earliest models of graphics coprocessors, particularly in the personal computer space, were simply tools to speed up some very specific parts of the 3D graphics pipeline. The cards had algorithms for applying lighting and shading models to simple geometric primitives like triangles. The algorithms were hard-wired into the video card and could not be changed. Communication with these cards flowed in one direction only, from the main computer to the video card. The cards were useful for rendering 3D graphics efficiently but could not be used in more general graphics applications.
As time progressed, the services provided by the video card’s processor expanded to include more general purpose routines. The data path between the main CPU and the graphics card widened and became bidirectional. With those innovations, programs gained the ability to use the graphics processor to perform calculations and retrieve the results to main memory. This allowed the video card to behave as a graphics computation engine, not just a display mechanism.
Collectively, these more powerful processors have come to be known as Graphics Processing Units or GPUs. They play a significant role in boosting the graphics capabilities of modern computers. Along with the GPU, a typical video card will also contain a block of dedicated memory (called Video RAM or VRAM) and some kind of hardware that converts bits in the cards display buffer into video signals for a computer display. In many respects, the video card resembles a self-contained graphics computer. Like other computers, graphics hardware continues to advance. To give you some idea of how rapidly video cards have evolved, consider the graph shown in Figure 2.2.
Figure 2.2 GPU fill rate by year
Figure 2.2 shows how the fill rate of graphics processors has grown over the years. A graphics card fill rate is roughly the number of pixels that the GPU can draw into video RAM in a single second. While the true processing power of the GPU varies for different tasks, this graph is a dramatic example of the advances that have been made in graphics processing power of GPUs.
The growth of graphics processing power actually exceeds the predictions of industry professionals. In 1965, Gordon Moore made a famous observation that the number of transistors on a single chip would double approximately every 18 months. This prediction has come to be known as Moore’s Law. The computing industry has taken Moore’s law to imply that the processing capabilities of silicon chips would grow at the same rate.
For many years the transistor counts and performance of CPUs has tracked this prediction with frightening accuracy. The graph in Figure 2.2 shows that the performance of GPUs has been growing faster than Moore’s Law would predict. In general terms, this means that the processing power of dedicated graphics processors is growing faster than that of general purpose CPUs like the PowerPC or the Intel x86 family. Applications that tap into that processing power enjoy dramatic performance improvements.
One of the reasons that graphics processors follow this performance curve is because the performance of the processor is easier to scale by throwing more silicon at the problem. Many of the algorithms that the GPU runs are what computer scientists call "embarrassingly parallel." An embarrassingly parallel problem is one in which a computer can easily work up a solution breaking it into smaller pieces and computing each piece along a parallel path.
Astute readers will recognize how a program might apply an SIMD vector processor, like the aforementioned Velocity Engine, to calculate a solution to an embarrassingly parallel problem. But graphics processors don’t need to solve the same problems that general purpose vector units must solve. Because it can focus on solving graphics problems, the GPU requires fewer operations. For example, general purpose processors must deal with branches, loops, and error checking. In contrast, the GPU pushes its vectors through sequentially without branches and loops. Each of the parallel units in a GPU is much simpler than its counterpart in a more general vector processor. Hardware engineers can add more vector units, and therefore more parallel computation paths, in the same area. More computation paths mean more operations completed each cycle.
Because the vector units in the GPUs are dedicated to graphics, they don’t suffer the resource contention issues that plague general purpose vector processors. There aren’t as many parts of applications competing for processor time.
The Programmable Graphics Card
Computers have spent many years sending data to graphics cards, but the ability to send programs to the GPU is a relatively recent innovation. Graphics cards that accept GPU programs from the main computer are known as programmable graphics cards. This ability to program the graphics card is the feature that lends power to graphics systems like Quartz 2D and, in particular, Core Image.
At it’s heart, Core Image is a system for feeding GPU programs to programmable video cards. The programs it submits usually apply special effects to images. By using the power of the parallel processing paths on the hardware, the computer can calculate those effects much faster than the main CPU could. Another interesting aspect of Core Image is that it can run its effects even if no programmable graphics card is available on the system. This demonstrates another advantage of the Mac OS X graphics architecture. It helps your applications produce improved performance without undue complexity.
Managing Hardware Complexity
The challenge to today’s applications is finding a way to conveniently take advantage of the power afforded by modern hardware. For example, application programmers who want to use the Velocity Engine must learn the AltiVec instruction set. They must also develop "vectorized" algorithms for solving the application’s problems. The applications often must rearrange data structures so that the vector processor can access them efficiently. All of these issues require specialized knowledge and add complexity to the resulting application.
In a similar way, making direct use of the GPU requires an application to understand some of intimate details about the video card. Some cards accept longer graphics programs than others, and some card have special instructions that simplify GPU code. Writing code that is general enough to support the diverse range of graphics hardware from different vendors is quite difficult. If you’re writing your application directly to the hardware, you will have to immerse yourself in the minutiae of every hardware combination, or you must limit your application to only that small set of hardware you are willing to support.
This same problem applies even in computers that don’t have video cards. Even tasks that appear simple, like copying pixel buffers in main memory efficiently, require a detailed knowledge of the processors’ cache behavior and the system’s virtual memory architecture. While these are all interesting topics, writing graphics programs this "close to the metal" can be complicated and error prone. Providing a rich, full-featured graphics system that allows applications to plug into the performance of the computer, while allowing programmers to focus on creating graphics and not hardware issues, is one of the toughest challenges the operating system vendor must face.
Quartz 2D and Core Image are both excellent technologies in this regard. They insulate applications from the complexity of the hardware but take advantage of that hardware in their own implementations. For example, Apple includes a compatibility mechanism inside of Core Image that allows the system to run GPU programs on any system. If a program uses Core Image on a computer without a programmable GPU, the program will continue to run correctly. The effects of applying a filter will take longer to achieve, but the results should be the same even without the dedicated hardware. The application programmer limits his attention to working with the interface of Core Image instead the details of the GPU. Similar arguments can be made with respect to the features of Quartz 2D. The graphics architecture of the system insulates applications from hardware details and is easier to use.