Home > Articles > Programming

Computer Graphics: Ray Casting and Rasterization

By John F. Hughes, Kurt Akeley, Steven K. Feiner, James D. Foley, Morgan McGuire, David Sklar, Andries van Dam
Aug 2, 2013

📄 Contents

␡

15.1. Introduction
15.2. High-Level Design Overview
15.3. Implementation Platform
15.4. A Ray-Casting Renderer
15.5. Intermezzo
15.6. Rasterization
15.7. Rendering with a Rasterization API
15.8. Performance and Optimization
15.9. Discussion
15.10. Exercises

⎙ Print

< Back Page 7 of 10 Next >

This chapter is from the book 

Computer Graphics: Principles and Practice, 3rd Edition

Learn More Buy

15.7. Rendering with a Rasterization API

Rasterization has been encapsulated in APIs. We’ve seen that although the basic rasterization algorithm is very simple, the process of increasing its performance can rapidly introduce complexity. Very-high-performance rasterizers can be very complex. This complexity leads to a desire to separate out the parts of the rasterizer that we might wish to change between applications while encapsulating the parts that we would like to optimize once, abstract with an API, and then never change again. Of course, it is rare that one truly is willing to never alter an algorithm again, so this means that by building an API for part of the rasterizer we are trading performance and ease of use in some cases for flexibility in others. Hardware rasterizers are an extreme example of an optimized implementation, where flexibility is severely compromised in exchange for very high performance.

There have been several popular rasterization APIs. Today, OpenGL and DirectX are among the most popular hardware APIs for real-time applications. RenderMan is a popular software rasterization API for offline rendering. The space in between, of software rasterizers that run in real time on GPUs, is currently a popular research area with a few open source implementations available [LHLW10, LK11, Pan11].

In contrast to the relative standardization and popularity enjoyed among rasterizer APIs, several ray-casting systems have been built and several APIs have been proposed, although they have yet to reach the current level of standardization and acceptance of the rasterization APIs.

This section describes the OpenGL-DirectX abstraction in general terms. We prefer generalities because the exact entry points for these APIs change on a fairly regular basis. The details of the current versions can be found in their respective manuals. While important for implementation, those details obscure the important ideas.

15.7.1. The Graphics Pipeline

Consider the basic operations of any of our software rasterizer implementations:

(Vertex) Per-vertex transformation to screen space
(Rasterize) Per-triangle (clipping to the near plane and) iteration over pixels, with perspective-correct interpolation
(Pixel) Per-pixel shading
(Output Merge) Merging the output of shading with the current color and depth buffers (e.g., alpha blending)

These are the major stages of a rasterization API, and they form a sequence called the graphics pipeline, which was introduced in Chapter 1. Throughout the rest of this chapter, we refer to software that invokes API entry points as host code and software that is invoked as callbacks by the API as device code. In the context of a hardware-accelerated implementation, such as OpenGL on a GPU, this means that the C++ code running on the CPU is host code and the vertex and pixel shaders executing on the GPU are device code.

15.7.1.1. Rasterizing Stage

Most of the complexity that we would like such an API to abstract is in the rasterizing stage. Under current algorithms, rasterization is most efficient when implemented with only a few parameters, so this stage is usually implemented as a fixed-function unit. In hardware this may literally mean a specific circuit that can only compute rasterization. In software this may simply denote a module that accepts no parameterization.

15.7.1.2. Vertex and Pixel Stages

The per-vertex and per-pixel operations are ones for which a programmer using the API may need to perform a great deal of customization to produce the desired image. For example, an engineering application may require an orthographic projection of each vertex instead of a perspective one. We’ve already changed our per-pixel shading code three times, to support Lambertian, Blinn-Phong, and Blinn-Phong plus shadowing, so clearly customization of that stage is important. The performance impact of allowing nearly unlimited customization of vertex and pixel operations is relatively small compared to the benefits of that customization and the cost of rasterization and output merging. Most APIs enable customization of vertex and pixel stages by accepting callback functions that are executed for each vertex and each pixel. In this case, the stages are called programmable units.

A pipeline implementation with programmable units is sometimes called a programmable pipeline. Beware that in this context, the pipeline order is in fact fixed, and only the units within it are programmable. Truly programmable pipelines in which the order of stages can be altered have been proposed [SFB⁺09] but are not currently in common use.

For historical reasons, the callback functions are often called shaders or programs. Thus, a pixel shader or “pixel program” is a callback function that will be executed at the per-pixel stage. For triangle rasterization, the pixel stage is often referred to as the fragment stage. A fragment is the portion of a triangle that overlaps the bounds of a pixel. It is a matter of viewpoint whether one is computing the shade of the fragment and sampling that shade at the pixel, or directly computing the shade at the pixel. The distinction only becomes important when computing visibility independently from shading. Multi-sample anti-aliasing (MSAA) is an example of this. Under that rasterization strategy, many visibility samples (with corresponding depth buffer and radiance samples) are computed within each pixel, but a single shade is applied to all the samples that pass the depth and visibility test. In this case, one truly is shading a fragment and not a pixel.

15.7.1.3. Output Merging Stage

The output merging stage is one that we might like to customize as consumers of the API. For example, one might imagine simulating translucent surfaces by blending the current and previous radiance values in the frame buffer. However, the output merger is also a stage that requires synchronization between potentially parallel instances of the pixel shading units, since it writes to a shared frame buffer. As a result, most APIs provide only limited customization at the output merge stage. That allows lockless access to the underlying data structures, since the implementation may explicitly schedule pixel shading to avoid contention at the frame buffer. The limited customization options typically allow the programmer to choose the operator for the depth comparison. They also typically allow a choice of compositing operator for color limited to linear blending, minimum, and maximum operations on the color values.

There are of course more operations for which one might wish to provide an abstracted interface. These include per-object and per-mesh transformations, tessellation of curved patches into triangles, and per-triangle operations like silhouette detection or surface extrusion. Various APIs offer abstractions of these within a programming model similar to vertex and pixel shaders.

Chapter 38 discusses how GPUs are designed to execute this pipeline efficiently. Also refer to your API manual for a discussion of the additional stages (e.g., tessellate, geometry) that may be available.

15.7.2. Interface

The interface to a software rasterization API can be very simple. Because a software rasterizer uses the same memory space and execution model as the host program, one can pass the scene as a pointer and the callbacks as function pointers or classes with virtual methods. Rather than individual triangles, it is convenient to pass whole meshes to a software rasterizer to decrease the per-triangle overhead.

For a hardware rasterization API, the host machine (i.e., CPU) and graphics device (i.e., GPU) may have separate memory spaces and execution models. In this case, shared memory and function pointers no longer suffice. Hardware rasterization APIs therefore must impose an explicit memory boundary and narrow entry points for negotiating it. (This is also true of the fallback and reference software implementations of those APIs, such as Mesa and DXRefRast.) Such an API requires the following entry points, which are detailed in subsequent subsections.

Allocate device memory.
Copy data between host and device memory.
Free device memory.
Load (and compile) a shading program from source.
Configure the output merger and other fixed-function state.
Bind a shading program and set its arguments.
Launch a draw call, a set of device threads to render a triangle list.

15.7.2.1. Memory Principles

The memory management routines are conceptually straightforward. They correspond to malloc, memcpy, and free, and they are typically applied to large arrays, such as an array of vertex data. They are complicated by the details necessary to achieve high performance for the case where data must be transferred per rendered frame, rather than once per scene. This occurs when streaming geometry for a scene that is too large for the device memory; for example, in a world large enough that the viewer can only ever observe a small fraction at a time. It also occurs when a data stream from another device, such as a camera, is an input to the rendering algorithm. Furthermore, hybrid software-hardware rendering and physics algorithms perform some processing on each of the host and device and must communicate each frame.

One complicating factor for memory transfer is that it is often desirable to adjust the data layout and precision of arrays during the transfer. The data structure for 2D buffers such as images and depth buffers on the host often resembles the “linear,” row-major ordering that we have used in this chapter. On a graphics processor, 2D buffers are often wrapped along Hilbert or Z-shaped (Morton) curves, or at least grouped into small blocks that are themselves row-major (i.e., “block-linear”), to avoid the cache penalty of vertical iteration. The origin of a buffer may differ, and often additional padding is required to ensure that rows have specific memory alignments for wide vector operations and reduced pointer size.

Another complicating factor for memory transfer is that one would often like to overlap computation with memory operations to avoid stalling either the host or device. Asynchronous transfers are typically accomplished by semantically mapping device memory into the host address space. Regular host memory operations can then be performed as if both shared a memory space. In this case the programmer must manually synchronize both host and device programs to ensure that data is never read by one while being written by the other. Mapped memory is typically uncached and often has alignment considerations, so the programmer must furthermore be careful to control access patterns.

Note that memory transfers are intended for large data. For small values, such as scalars, 4×4 matrices, and even short arrays, it would be burdensome to explicitly allocate, copy, and free the values. For a shading program with twenty or so arguments, that would incur both runtime and software management overhead. So small values are often passed through a different API associated with shaders.

15.7.2.2. Memory Practice

Listing 15.30 shows part of an implementation of a triangle mesh class. Making rendering calls to transfer individual triangles from the host to the graphics device would be inefficient. So, the API forces us to load a large array of the geometry to the device once when the scene is created, and to encode that geometry as efficiently as possible.

Few programmers write directly to hardware graphics APIs. Those APIs reflect the fact that they are designed by committees and negotiated among vendors. They provide the necessary functionality but do so through awkward interfaces that obscure the underlying function of the calling code. Usage is error-prone because the code operates directly on pointers and uses manually managed memory.

For example, in OpenGL, the code to allocate a device array and bind it to a shader input looks something like Listing 15.29. Most programmers abstract these direct host calls into a vendor-independent, easier-to-use interface.

Listing 15.29: Host code for transferring an array of vertices to the device and binding it to a shader input.

 1  // Allocate memory:
 2  GLuint vbo;
 3  glGenBuffers(1, &vbo);
 4  glBindBuffer(GL_ARRAY_BUFFER, vbo);
 5  glBufferData(GL_ARRAY_BUFFER, hostVertex.size() * 2 * sizeof(Vector3), NULL,GL_STATIC_DRAW);
 6  GLvoid* deviceVertex = 0;
 7  GLvoid* deviceNormal = hostVertex.size() * sizeof(Vector3);
 8
 9  // Copy memory:
10  glBufferSubData(GL_ARRAY_BUFFER, deviceVertex, hostVertex.size() *
         sizeof(Point3), &hostVertex[0]);
11
12  // Bind the array to a shader input:
13  int vertexIndex = glGetAttribLocation(shader, "vertex");
14  glEnableVertexAttribArray(vertexIndex);
15  glVertexAttribPointer(vertexIndex, 3, GL_FLOAT, GL_FALSE, 0, deviceVertex);

Most programmers wrap the underlying hardware API with their own layer that is easier to use and provides type safety and memory management. This also has the advantage of abstracting the renderer from the specific hardware API. Most console, OS, and mobile device vendors intentionally use equivalent but incompatible hardware rendering APIs. Abstracting the specific hardware API into a generic one makes it easier for a single code base to support multiple platforms, albeit at the cost of one additional level of function invocation.

For Listing 15.30, we wrote to one such platform abstraction instead of directly to a hardware API. In this code, the VertexBuffer class is a managed memory array in device RAM and AttributeArray and IndexArray are subsets of a VertexBuffer. The “vertex” in the name means that these classes store per-vertex data. It does not mean that they store only vertex positions—for example, the m_normal array is stored in an AttributeArray. This naming convention is a bit confusing, but it is inherited from OpenGL and DirectX. You can either translate this code to the hardware API of your choice, implement the VertexBuffer and AttributeArray classes yourself, or use a higher-level API such as G3D that provides these abstractions.

Listing 15.30: Host code for an indexed triangle mesh (equivalent to a set of `Triangle` instances that share a `BSDF`).

 1  class Mesh {
 2  private:
 3      AttributeArray     m_vertex;
 4      AttributeArray     m_normal;
 5      IndexStream        m_index;
 6
 7      shared_ptr<BSDF>   m_bsdf;
 8
 9  public:
10
11      Mesh() {}
12
13      Mesh(const std::vector<Point3>& vertex,
14          const std::vector<Vector3>& normal,
15          const std::vector<int>& index, const shared_ptr<BSDF>& bsdf) : m_bsdf(bsdf) {
16
17          shared_ptr<VertexBuffer> dataBuffer =
18            VertexBuffer::create((vertex.size() + normal.size()) *
19              sizeof(Vector3) + sizeof(int) * index.size());
20          m_vertex = AttributeArray(&vertex[0], vertex.size(), dataBuffer);
21          m_normal = AttributeArray(&normal[0], normal.size(), dataBuffer);
22
23          m_index = IndexStream(&index[0], index.size(), dataBuffer);
24      }
25
26      . . .
27  };
28
29  /** The rendering API pushes us towards a mesh representation
30      because it would be inefficient to make per-triangle calls. */
31  class MeshScene {
32  public:
33      std::vector<Light>    lightArray;
34      std::vector<Mesh>     meshArray;
35  };

Listing 15.31 shows how this code is used to model the triangle-and-ground-plane scene. In it, the process of uploading the geometry to the graphics device is entirely abstracted within the Mesh class.

Listing 15.31: Host code to create indexed triangle meshes for the triangle-plus-ground scene.

 1  void makeTrianglePlusGroundScene(MeshScene& s) {
 2      std::vector<Vector3> vertex, normal;
 3      std::vector<int> index;
 4
 5      // Green triangle geometry
 6      vertex.push_back(Point3(0,1,-2)); vertex.push_back(Point3(-1.9f,-1,-2));
            vertex.push_back(Point3(1.6f,-0.5f,-2));
 7      normal.push_back(Vector3(0,0.6f,1).direction()); normal.
            push_back(Vector3(-0.4f,-0.4f, 1.0f).direction()); normal.
            push_back(Vector3(0.4f,-0.4f, 1.0f).direction());
 8      index.push_back(0); index.push_back(1); index.push_back(2);
 9      index.push_back(0); index.push_back(2); index.push_back(1);
10      shared_ptr<BSDF> greenBSDF(new PhongBSDF(Color3::green() * 0.8f,
11                                               Color3::white() * 0.2f, 100));
12 
13      s.meshArray.push_back(Mesh(vertex, normal, index, greenBSDF));
14      vertex.clear(); normal.clear(); index.clear();
15
16      /////////////////////////////////////////////////////////
17      // Ground plane geometry
18      const float groundY = -1.0f;
19      vertex.push_back(Point3(-10, groundY, -10)); vertex.push_back(Point3(-10,
20          groundY, -0.01f));
21      vertex.push_back(Point3(10, groundY, -0.01f)); vertex.push_back(Point3(10,
22          groundY, -10));
23
24      normal.push_back(Vector3::unitY()); normal.push_back(Vector3::unitY());
25      normal.push_back(Vector3::unitY()); normal.push_back(Vector3::unitY());
26
27      index.push_back(0); index.push_back(1); index.push_back(2);
28      index.push_back(0); index.push_back(2); index.push_back(3);
29
30      const Color3 groundColor = Color3::white() * 0.8f;
31      s.meshArray.push_back(Mesh(vertex, normal, index, groundColor));
32
33      //////////////////////////////////////////////////////////
34      // Light source
35      s.lightArray.resize(1);
36      s.lightArray[0].position = Vector3(1, 3, 1);
37      s.lightArray[0].power = Color3::white() * 31.0f;
38  }

15.7.2.3. Creating Shaders

The vertex shader must transform the input vertex in global coordinates to a homogeneous point on the image plane. Listing 15.32 implements this transformation. We chose to use the OpenGL Shading Language (GLSL). GLSL is representative of other contemporary shading languages like HLSL, Cg, and RenderMan. All of these are similar to C++. However, there are some minor syntactic differences between GLSL and C++ that we call out here to aid your reading of this example. In GLSL,

Arguments that are constant over all triangles are passed as global (“uniform”) variables.
Points, vectors, and colors are all stored in vec3 type.
const has different semantics (compile-time constant).
in, out, and inout are used in place of C++ reference syntax.
length, dot, etc. are functions instead of methods on vector classes.

Listing 15.32: Vertex shader for projecting vertices. The output is in homogeneous space before the division operation. This corresponds to the `perspectiveProject` function from Listing 15.24.

 1  #version 130
 2
 3  // Triangle vertices
 4  in vec3 vertex;
 5  in vec3 normal;
 6
 7  // Camera and screen parameters
 8  uniform float fieldOfViewX;
 9  uniform float zNear;
10  uniform float zFar;
11  uniform float width;
12  uniform float height;
13
14  // Position to be interpolated
15  out vec3 Pinterp;
16
17  // Normal to be interpolated
18  out vec3 ninterp;
19
20  vec4 perspectiveProject(in vec3 P) {
21      // Compute the side of a square at z = -1 based on our
22      // horizontal left-edge-to-right-edge field of view .
23      float s = -2.0f * tan(fieldOfViewX * 0.5f);
24      float aspect = height / width;
25
26      // Project onto z = -1
27      vec4 Q;
28      Q.x = 2.0 * -Q.x / s;
29      Q.y = 2.0 * -Q.y / (s * aspect);
30      Q.z = 1.0;
31      Q.w = -P.z;
32
33      return Q;
34  }
35
36  void main() {
37      Pinterp = vertex;
38      ninterp = normal;
39
40      gl_Position = perspectiveProject(Pinterp);
41  }

None of these affect the expressiveness or performance of the basic language. The specifics of shading-language syntax change frequently as new versions are released, so don’t focus too much on the details. The point of this example is how the overall form of our original program is preserved but adjusted to the conventions of the hardware API.

Under the OpenGL API, the outputs of a vertex shader are a set of attributes and a vertex of the form (x, y, a, –z). That is, a homogeneous point for which the perspective division has not yet been performed. The value a/– z will be used for the depth test. We choose a = 1 so that the depth test is performed on –1/z, which is a positive value for the negative z locations that will be visible to the camera. We previously saw that any function that provides a consistent depth ordering can be used for the depth test. We mentioned that distance along the eye ray, –z, and –1/z are common choices. Typically one scales the a value such that –a/z is in the range [0, 1] or [–1, 1], but for simplicity we’ll omit that here. See Chapter 13 for the derivation of that transformation.

Note that we did not scale the output vertex to the dimensions of the image, negate the y-axis, or translate the origin to the upper left in screen space, as we did for the software renderer. That is because by convention, OpenGL considers the upper-left corner of the screen to be at (–1, 1) and the lower-right corner at (1, –1).

We choose the 3D position of the vertex and its normal as our attributes. The hardware rasterizer will automatically interpolate these across the surface of the triangle in a perspective-correct manner. We need to treat the vertex as an attribute because OpenGL does not expose the 3D coordinates of the point being shaded.

Listings 15.33 and 15.34 give the pixel shader code for the shade routine, which corresponds to the shade function from Listing 15.17, and helper functions that correspond to the visible and BSDF::evaluateFiniteScatteringDensity routines from the ray tracer and software rasterizer. The output of the shader is in homogeneous space before the division operation. This corresponds to the perspectiveProject function from Listing 15.24. The interpolated attributes enter the shader as global variables Pinterp and ninterp. We then perform shading in exactly the same manner as for the software renderers.

Listing 15.33: Pixel shader for computing the radiance scattered toward the camera from one triangle illuminated by one light.

 1  #version 130
 2  // BSDF
 3  uniform vec3    lambertian;
 4  uniform vec3    glossy;
 5  uniform float   glossySharpness;
 6
 7  // Light
 8  uniform vec3    lightPosition;
 9  uniform vec3    lightPower;
10
11  // Pre-rendered depth map from the light’s position
12  uniform sampler2DShadow shadowMap;
13
14  // Point being shaded. OpenGL has automatically performed
15  // homogeneous division and perspective-correct interpolation for us.
16  in vec3                   Pinterp;
17  in vec3                   ninterp;
18
19  // Value we are computing
20  out vec3                  radiance;
21
22  // Normalize the interpolated normal; OpenGL does not automatically
23  // renormalize for us.
24  vec3 n = normalize(ninterp);
25
26  vec3 shade(const in vec3 P, const in vec3 n) {
27  vec3 radiance              = vec3(0.0);
28
29    // Assume only one light
30    vec3 offset           = lightPosition - P;
31    float distanceToLight = length(offset);
32    vec3 w_i              = offset / distanceToLight;
33    vec3 w_o              = -normalize(P);
34
35    if (visible(P, w_i, distanceToLight, shadowMap)) {
36        vec3 L_i = lightPower / (4 * PI * distanceToLight * distanceToLight);
37
38        // Scatter the light.
39        radiance +=
40            L_i *
41            evaluateFiniteScatteringDensity(w_i, w_o) *
42            max(0.0, dot(w_i, n));
43    }
44
45    return radiance;
46  }
47
48  void main() {
49       vec3 P = Pinterp;
50
51
52       radiance = shade(P, n);
53  }

Listing 15.34: Helper functions for the pixel shader.

 1  #define PI 3.1415927
 2
 3  bool visible(const  in vec3 P, const in vec3 w_i, const in float distanceToLight,
        sampler2DShadow shadowMap) {
 4      return true;
 5  }
 6
 7  /** Returns f(wi, wo). Same as BSDF::evaluateFiniteScatteringDensity
 8      from the ray tracer. */
 9  vec3 evaluateFiniteScatteringDensity(const in vec3 w_i, const in vec3 w_o) {
10       vec3 w_h = normalize(w_i + w_o);
11
12       return (k_L +
13               k_G * ((s + 8.0) * pow(max(0.0, dot(w_h, n)), s) / 8.0)) / PI;
14  }

However, there is one exception. The software renderers iterated over all the lights in the scene for each point to be shaded. The pixel shader is hardcoded to accept a single light source. That is because processing a variable number of arguments is challenging at the hardware level. For performance, the inputs to shaders are typically passed through registers, not heap memory. Register allocation is generally a major factor in optimization. Therefore, most shading compilers require the number of registers consumed to be known at compile time, which precludes passing variable length arrays. Programmers have developed three forward-rendering design patterns for working within this limitation. These use a single framebuffer and thus limit the total space required by the algorithm. A fourth and currently popular deferred-rendering method requires additional space.

Multipass Rendering: Make one pass per light over all geometry, summing the individual results. This works because light combines by superposition. However, one has to be careful to resolve visibility correctly on the first pass and then never alter the depth buffer. This is the simplest and most elegant solution. It is also the slowest because the overhead of launching a pixel shader may be significant, so launching it multiple times to shade the same point is inefficient.
Übershader: Bound the total number of lights, write a shader for that maximum number, and set the unused lights to have zero power. This is one of the most common solutions. If the overhead of launching the pixel shader is high and there is significant work involved in reading the BSDF parameters, the added cost of including a few unused lights may be low. This is a fairly straightforward modification to the base shader and is a good compromise between performance and code clarity.
Code Generation: Generate a set of shading programs, one for each number of lights. These are typically produced by writing another program that automatically generates the shader code. Load all of these shaders at runtime and bind whichever one matches the number of lights affecting a particular object. This achieves high performance if the shader only needs to be swapped a few times per frame, and is potentially the fastest method. However, it requires significant infrastructure for managing both the source code and the compiled versions of all the shaders, and may actually be slower than the conservative solution if changing shaders is an expensive operation.
If there are different BSDF terms for different surfaces, then we have to deal with all the permutations of the number of lights and the BSDF variations. We again choose between the above three options. This combinatorial explosion is one of the primary drawbacks of current shading languages, and it arises directly from the requirement that the shading compiler produce efficient code. It is not hard to design more flexible languages and to write compilers for them. But our motivation for moving to a hardware API was largely to achieve increased performance, so we are unlikely to accept a more general shading language if it significantly degrades performance.
Deferred Lighting: A deferred approach that addresses these problems but requires more memory is to separate the computation of which point will color each pixel from illumination computation. An initial rendering pass renders many parallel buffers that encode the shading coefficients, surface normal, and location of each point (often, assuming an übershader). Subsequent passes then iterate over the screen-space area conservatively affected by each light, computing and summing illumination. Two common structures for those lighting passes are multiple lights applied to large screen-space tiles and ellipsoids for individual lights that cover the volume within which their contribution is non-negligible.

For the single-light case, moving from our own software rasterizer to a hardware API did not change our perspectiveProject and shade functions substantially.

However, our shade function was not particularly powerful. Although we did not choose to do so, in our software rasterizer, we could have executed arbitrary code inside the shade function. For example, we could have written to locations other than the current pixel in the frame buffer, or cast rays for shadows or reflections. Such operations are typically disallowed in a hardware API. That is because they interfere with the implementation’s ability to efficiently schedule parallel instances of the shading programs in the absence of explicit (inefficient) memory locks.

This leaves us with two choices when designing an algorithm with more significant processing, especially at the pixel level. The first choice is to build a hybrid renderer that performs some of the processing on a more general processor, such as the host, or perhaps on a general computation API (e.g., CUDA, Direct Compute, OpenCL, OpenGL Compute). Hybrid renderers typically incur the cost of additional memory operations and the associated synchronization complexity.

The second choice is to frame the algorithm purely in terms of rasterization operations, and make multiple rasterization passes. For example, we can’t conveniently cast shadow rays in most hardware rendering APIs today. But we can sample from a previously rendered shadow map.

Similar methods exist for implementing reflection, refraction, and indirect illumination purely in terms of rasterization. These avoid much of the performance overhead of hybrid rendering and leverage the high performance of hardware rasterization. However, they may not be the most natural way of expressing an algorithm, and that may lead to a net inefficiency and certainly to additional software complexity. Recall that changing the order of iteration from ray casting to rasterization increased the space demands of rendering by requiring a depth buffer to store intermediate results. In general, converting an arbitrary algorithm to a rasterization-based one often has this effect. The space demands might grow larger than is practical in cases where those intermediate results are themselves large.

Shading languages are almost always compiled into executable code at runtime, inside the API. That is because even within products from one vendor the underlying micro-architecture may vary significantly. This creates a tension within the compiler between optimizing the target code and producing the executable quickly. Most implementations err on the side of optimization, since shaders are often loaded once per scene. Beware that if you synthesize or stream shaders throughout the rendering process there may be substantial overhead.

Some languages (e.g., HLSL and CUDA) offer an initial compilation step to an intermediate representation. This eliminates the runtime cost of parsing and some trivial compilation operations while maintaining flexibility to optimize for a specific device. It also allows software developers to distribute their graphics applications without revealing the shading programs to the end-user in a human-readable form on the file system. For closed systems with fixed specifications, such as game consoles, it is possible to compile shading programs down to true machine code. That is because on those systems the exact runtime device is known at host-program compile time. However, doing so would reveal some details of the proprietary micro-architecture, so even in this case vendors do not always choose to have their APIs perform a complete compilation step.

15.7.2.4. Executing Draw Calls

To invoke the shaders we issue draw calls. These occur on the host side. One typically clears the framebuffer, and then, for each mesh, performs the following operations.

Set fixed function state.
Bind a shader.
Set shader arguments.
Issue the draw call.

These are followed by a call to send the framebuffer to the display, which is often called a buffer swap. An abstracted implementation of this process might look like Listing 15.35. This is called from a main rendering loop, such as Listing 15.36.

Listing 15.35: Host code to set fixed-function state and shader arguments, and to launch a draw call under an abstracted hardware API.

 1  void loopBody(RenderDevice* gpu) {
 2      gpu->setColorClearValue(Color3::cyan() * 0.1f);
 3      gpu->clear();
 4
 5      const Light& light = scene.lightArray[0];
 6
 7      for (unsigned int m = 0; m < scene.meshArray.size(); ++m) {
 8          Args args;
 9          const Mesh& mesh = scene.meshArray[m];
10          const shared_ptr<BSDF>& bsdf = mesh.bsdf();
11
12          args.setUniform("fieldOfViewX",      camera.fieldOfViewX);
13          args.setUniform("zNear",             camera.zNear);
14          args.setUniform("zFar",              camera.zFar);
15
16          args.setUniform("lambertian",        bsdf->lambertian);
17          args.setUniform("glossy",            bsdf->glossy);
18          args.setUniform("glossySharpness",   bsdf->glossySharpness);
19
20          args.setUniform("lightPosition",     light.position);
21          args.setUniform("lightPower",        light.power);
22
23          args.setUniform("shadowMap",         shadowMap);
24
25          args.setUniform("width",             gpu->width());
26          args.setUniform("height",            gpu->height());
27
28          gpu->setShader(shader);
29
30          mesh.sendGeometry(gpu, args);
31     }
32     gpu->swapBuffers();
33  }

Listing 15.36: Host code to set up the main hardware rendering loop.

 1  OSWindow::Settings osWindowSettings;
 2  RenderDevice* gpu = new RenderDevice();
 3  gpu->init(osWindowSettings);
 4
 5  // Load the vertex and pixel programs
 6  shader = Shader::fromFiles("project.vrt", "shade.pix");
 7
 8  shadowMap = Texture::createEmpty("Shadow map", 1024, 1024,
 9      ImageFormat::DEPTH24(), Texture::DIM_2D_NPOT, Texture::Settings::shadow());
10  makeTrianglePlusGroundScene(scene);
11
12  // The depth test will run directly on the interpolated value in
13  // Q.z/Q.w, which is going to be smallest at the far plane
14  gpu->setDepthTest(RenderDevice::DEPTH_GREATER);
15  gpu->setDepthClearValue(0.0);
16
17  while (! done) {
18      loopBody(gpu);
19      processUserInput();
20  }
21
22  ...