On a lot of older systems, arguments were always passed on the stack. If you wanted to pass a pointer to a subroutine, you had to push it onto the stack, make the call, and then pop it off again. This technique was quite fast, but it used a bit of stack space for the copies, so a lot of people started optimizing.
The common optimization was to store a pointer to something used in a set of subroutines in a global variable. Every call just needed to do a single load to get at the value. This approach was fairly sensible, but it has a number of problems in a modern system:
- The first problem is that this optimization was originally used in position-dependent code; that is, not in dynamic libraries. Therefore every global variable was allocated a static address. When you loaded a value from a global, you were performing a single load instruction with a constant operandvery fast. On modern systems, and in shared libraries where I've seen people try to use this technique, typically a hidden layer of indirection is involved. Global variables in position-independent code are accessed by loading the base address of the library and then adding an offset. Often the base address is found by loading a value at a non-fixed offset from a value in a register, meaning that you're now requiring several load operations to get at this valuemore than you need to read the value from the stack.
- The second problem with this technique is that arguments are now very rarely passed in memory. On a modern RISC machine (and even on something like x8664), the arguments are likely to be passed in registers. On something like x86, the top few stack slots are likely to be aliased with hidden registers, so accessing them is almost as fast as accessing a register. Therefore, the simple approachjust passing the pointer as an argumentoften requires no memory accesses, while the "optimized" version can require three or four. I've seen a 1020% speed improvement by undoing this optimization in some code. As a nice side effect, removing the reference to a global variable made the code reentrant, so it was trivial to split it over several threads and use the power of a modern multicore CPU.
The slowdown from this change comes from the way in which memory hierarchies have evolved. Currently, cache misses are very expensive, whereas on older machines all accesses to variables in memory were at roughly the same speed. This triple-indirection means that you need to use at least three cache lines in order to be able to access the global variable quickly. Cache works best when you have values that are accessed at the same time and near each other in memory, but this optimization has the opposite effect. This approach can increase cache churnthe repeated loading and unloading of bits of memory into and out of the cachewhich can have a serious negative impact on performance.