Over the last decade, we've seen a subtle change in how mainstream processors are designed. Earlier chips were created for specific languages. The Berkeley RISC, for example, was designed to run C code, which contained a lot more structure than typical assembly programs of the era. Lisp machines, obviously, were designed for Lisp.
Additions to the instruction sets were usually made to make implementing certain language features easier. CALL instructions simplified implementing procedural languages. Floating-point operations made it trivial to support the floating-point data types that most languages included.
This was quite an easy way to use up transistors early on. More recent chips have started moving in a slightly different direction, providing features for specific algorithms rather than for specific languages. Vector units were among the first extensions to be created in this way. They were added specifically to make it easier to run certain multimedia applications.
For the first time since the 1980s, compiler writers were no longer the main target market for new instruction set additions; instead, library authors were targeted. You can see a concrete example with SSE. Modern x86 compilers use the SSE instruction set rather than the x87 coprocessor for floating-point operations, because the SSE is easier to target. With SSE, you can (usually) run an operation on four 32-bit or two 64-bit values for about the same cost as running it for one floating-point operand.
Unfortunately, typical compiled code only gets one-quarter to one-half the theoretical maximum floating-point throughput of the CPU. Various autovectorization techniques attempt to group operations so that they can be run in parallel, but this is quite difficult, and compilers aren't very good at it. Making efficient use of these units requires some programmer effort. You can't just take an arbitrary C program and run all of the floating-point operations on a vector unit, but you can extend C to allow vector types.
It's not unusual for a modern computer to contain a coprocessor with a higher transistor count than the main CPU has. The name of this coprocessor, the graphics processing unit (GPU), is a significant clue that it was designed for running specific algorithms, rather than for specific languages.
Indeed, several languages have been designed specifically to run on these coprocessorsCg, GLSL, HLSL, and so on. In some ways, this situation represents a step backward in computing. The promise of high-level languages was that they would allow you to ignore the details of the underlying architecture. Unfortunately, this promise wasn't entirely fulfilled. Languages like C, which barely qualifies as "high-level" these days, expose too much detail about certain types of architecture, while others hide the system so well that even simple things become expensive.
The Language Barrier
When I first learned it, C was regarded as a high-level language because it was portable. In contrast, PL/M, the language that I learned before C, was a low-level language because it exposed CPU-specific features such as memory segmentation directly to the programmer. These days, C is more commonly regarded as a low-level language.
What does this semantic difference really mean? In rough terms, in a low-level language, the language constructs map simply to instruction sequences. An optimizing C compiler will reorder instructions, perform substitutions, and so on, but the basic language features can each be implemented in just a few instructions on a typical architecture. C only supports operations on primitive types, which typically translate to a single instruction, possibly bracketed by a load and a store. Flow control in C is implemented trivially in terms of jump and call instructions.
This is why C is generally considered to be fastnot because anything about the language makes it especially efficient. Quite the reverse, in fact. For example, if a language doesn't give the programmer control over memory layout, it gives the compiler a lot more opportunities to make use of vector instructions, which requires a lot of cleverness on the part of the compiler. C, in contrast, runs quickly with a stupid compiler.
You can see this principle demonstrated in the case of the Portable C Compiler. In spite of performing very few optimizations, it still produces code that's reasonably fast. It's slower than the output from a better C compiler, but still a lot better than a naïve implementation of a language like Smalltalk or Haskell.
Speed isn't always the result, however. If you take a typical C program and compile it for a modern GPU, you'll see some problems. A modern CPU is designed around branch-heavy programs. Typical profiling tests show that most code written in C-like languages branches every seven instructions, on average. A lot of effort is spent on making branch prediction work well in these processors. If you can't predict the target of a branch, then you have to wait for all of the instructions in the pipeline to finish before you can run it.
By contrast, GPUs don't bother with branch prediction. They expect to run code that has a branch only once in every few hundred instructions. Instead, they use transistors to implement things like trigonometric operations as single instructions. They also expect to run the same program fragment on large amounts of data. In a typical 3D program, they run the same vertex shader on every point in a scene, the same pixel shader on every textured pixel, and so on. NVIDIA's current architecture runs threads in groups of four, purely in parallel when they're executing the same instructions at the same time, but with a big performance penalty if they get out of sync (for instance, if two different threads take different branches). Finally, they run a lot of these thread groups in parallel.
C isn't a low-level language on such an architecture. It doesn't contain vector types, it doesn't expose any of the more complex operations that are available, and it doesn't include any support for concurrency. In this case, C is considered a high-level language because it presents an abstract model that doesn't closely match the underlying hardware.