Computer graphics have often been called "embarrassingly parallel." Large batches of vertices need to be transformed with identical or similar matrices. Large blocks of pixels need to be rendered with identical or similar shaders. Large images need to be blended to create a final image. Hence, the natural evolution of the GPU architecture has been toward a multithreaded single instruction multiple data (SIMD) machine with parallel execution units in three areas: the vertex, pixel, and raster portions of the chip. The architecture has also evolved to handle massive amounts of data on the GPU. Today, onboard memory is topping out at half a gigabyte, with additional fast access to main memory through PCI Express (theoretically at 4 GB/s in each direction). Internal memory bandwidth—between the GPU and its own local video memory—is around 40 GB/s.
As you can imagine, the GPU is particularly well suited to definite arenas. Certain problem sets lend themselves to the current GPU architecture, and we can confidently assume that future GPUs will become more flexible and handle previously unreasonable problem sets. In other words, the GPU is often the best tool for the job, and, due to rapid increases in brute computational power (see the trend line in Figure 1), is poised to grab more and more of the high-performance computing (HPC) market. Those experiencing the greatest performance increases have problems in which the computational or arithmetic intensity of the problem is high, meaning that many math operations occur for each piece of data that's read or written. This is because most of the transistors in GPUs are devoted to computational resources, and because memory accesses are always significantly more expensive in comparison to computation. For the same reasons, the GPU excels when off-chip communication is reduced or eliminated.
Figure 1 Rapidly increasing GPU capabilities.
Over the past five years, the GPU has rapidly evolved into a programmable "stream processor." Conceptually, all data it processes can be considered a stream—an ordered set of data of the same datatype. Streams can range from simple arrays of integers to complex arrays of 4 × 4 32-bit floating-point matrices, or even arrays of arbitrary user-defined structures. The streams are fed through a kernel that applies the same function to each element of the entire stream. Kernels can operate on the data in multiple ways: 
- Transformation. Remapping.
- Expansion. Creating multiple outputs per input element.
- Reduction. Creating one output from multiple inputs.
- Filter. Outputting a subset of the input elements.
One particular restriction to flexibility is key to enabling parallel execution and high-performance computation: "[K]ernel outputs are functions only of their kernel inputs, and within a kernel."