Optimization Options
Code optimization is an attempt to improve performance. The trade-off is lengthened compile times, increased memory usage during compilation, and, in some cases, a larger disk footprint for the resulting binary. While some optimizations are general in nature and can be applied in any circumstance, other optimizations are designed to exploit features of a given CPU or CPU family. This section looks at both classes of optimization options.
The bare -O option tells GCC to reduce both code size and execution time. It is equivalent to -O1. The types of optimization performed at this level depend on the target processor, but always include at least thread jumps and deferred stack pops. Thread jump optimizations attempt to reduce the number of jump operations; deferred stack pops occur when the compiler lets arguments accumulate on the stack as functions return and then pops them simultaneously, rather than popping the arguments piecemeal as each called function returns.
-O2 level optimizations include all first-level optimizations plus additional tweaks that involve processor instruction scheduling. At this level, the compiler takes care to make sure the processor has instructions to execute while waiting for the results of other instructions or while waiting for data to be retrieved from second-level cache or main memory. The implementation of these optimizations, however, is highly processor-specific. -O3 options include all -O2 optimizations, loop unrolling, and other processor-specific features.
Depending on the amount of low-level knowledge you have about a given CPU family, you can use the -f{flag} option to request specific optimizations you want performed. Table 3.4 lists eight -f optimization flags that are often useful.
Table 3.4 GCC Optimization Flags
Flag |
Effect |
-ffloat-store |
Suppresses storing the value of floating-point variables in CPU registers. This will save CPU registers for other uses and prevent unnecessarily precise floating-point numbers from being generated. |
-ffast-math |
Generates floating-point math optimizations that are faster but that violate IEEE and/or ANSI/ISO standards. If your program does not need strict IEEE adherence, consider using this flag when compiling programs that are floating-point intensive. |
-finline-functions |
Expands all simple functions in place inside their callers. The compiler decides what constitutes a simple function. Reducing the processor overhead associated with function calls is a basic optimization technique. |
-funroll-loops |
Unrolls all loops having a fixed number of iterations that can be determined at compile time. Unrolling loops saves several CPU instructions per loop iteration, dramatically decreasing execution time. |
-fomit-frame-pointer |
Discards a frame pointer stored in a CPU register if the function does not need one. This speeds up processing because the instructions necessary to set up, save, and restore frame pointers are eliminated. |
-fschedule-insns |
Reorders instructions that may stall because they are waiting for data that is not in the CPU. |
-fschedule-insns2 |
Performs a second round of instruction reordering (similar to -fschedule-insns). |
-fmove-all-movables |
Moves all invariant calculations occurring inside a loop outside of the loop. This eliminates unnecessary operations from the loop, speeding up its overall operations. |
Inlining and loop unrolling can greatly improve a program's execution speed because they avoid the overhead of function calls and variable lookups, but the cost is usually a large increase in the size of the object or binary files. You will need to experiment to see if faster execution time, if any, is worth the increased file size. In general, when playing with compiler options of the sort listed in Table 3.4, experimentation and code profiling, or performance analysis, are necessary to confirm that a given optimization has the desired effect.
As an experiment, the following program, pisqrt.c (see Listing 3.7), calculates the square root of pi 10,000,000 times. Table 3.5 lists the optimization or processor flag used to compile the program and the average execution time of ten runs of pisqrt on a Pentium II 260 MHz CPU with 128MB RAM.
Listing 3.7 Calculate the Square Root of pi
/* * pisqrt.c - Calculate the square of PI 10,000,000 * times. */ #include <stdio.h> #include <math.h> int main(void) { double pi = M_PI; /* Defined in <math.h> */ double pisqrt; long i; for(i = 0; i < 10000000; ++i) { pisqrt = sqrt(pi); } return 0;}
Table 3.5 pisqrt Execution Times
Flag/Optimization |
Average Execution Time |
<none> |
5.43 seconds |
-O1 |
2.74 seconds |
-O2 |
2.83 seconds |
-O3 |
2.76 seconds |
-ffloat-store |
5.41 seconds |
-ffast-math |
5.46 seconds |
-funroll-loops |
5.44 seconds |
-fschedule-insns |
5.45 seconds |
-fschedule-insns2 |
5.44 seconds |
This not terribly rigorous experiment shows that, at least for this program, letting the compiler choose the right set of optimizations, using -O1, -O2, and -O3, results in the greatest performance gains. The lesson to take away from this demonstration is that unless you know a great deal about processor architecture or know that a particular optimization will have a specific effect your program needs, stick with the -O optimization options.
Tip - In general, Linux programmers seem to use -O2 optimization. Even on small programs, like the hello.c program introduced at the beginning of this chapter, you will see small reductions in code size and in performance time. This is based more on habit, however, than empirical testing. As Table 3.5 shows, -O1 optimization resulted in the best performance increase for the program tested. The moral? Try different optimization levels to see which one has the best results!