Acovea: Analysis of Compiler Options via Evolutionary Algorithm

Using Natural Selection to Investigate Software Complexities


by Scott Robert Ladd
16 December 2003 (obsolete)
 

This article has been superceded by

Results & Analysis: Acovea 4.0.0 on Pentium 4 and Opteron
 

On 30 March 2004, I published a revised version of Acovea, along with new results based on testing with Pentium 4 and Opteron systems running Linux. The information below is still relevant, if a bit dated.

 


Related Articles
Acovea Overview
Analysis: Acovea 4.0.0 on Pentium 4 and Opteron
Acovea FAQ
A Description of the Genetic Algorithm

 Table 1: Test System "Tycho"
 Intel D850EMV2 motherboard
 Pentium 4 2.8GHz, Uniprocessor w/HT
 512MB (PC800) RAM,  533MHz FSB
 80GB ATA/100 HD
 ATI Radeon 9000 Pro 64MB video
 Debian "sid" (unstable)
 linux v2.6.0-test11 SMP
 glibc 2.3.2, binutils 2.13.1
 GCC 3.4 20021029 (custom compiled)

I ran these tests on Tycho, an Intel Pentium 4-based system detailed in Table 1. All compiles included the -march=pentium4 option, which implies -mcpu=pentium4, -msse, -msse2, and -mmmx. Astute readers will note that Tycho is running a "test" version of the Linux kernel in conjunction with the "unstable" release of Debian GNU/Linux. I ran all tests from the bash prompt, in text mode, no X server, with a minimum of daemons present.

A disclaimer: Every time I publish a review, several people e-mail to complain that I didn't study Athlon, Opteron, and PowerPC processors. I can't study systems I do not have; if you have access to such systems, please feel free to download the code and run the tests yourself. I may run tests on my UltraSPARC system, but I'm not entirely certain such tests have any value. I have performed testing on Pentium 3 systems, and I'll likely update this article if an when any new hardware comes in my front door.

All tests were performed on a recent snapshot of GCC 3.4. GCC 3.3 is the current and stable product; any discoveries I might make will not result in any improvement to a stable compiler. GCC 3.4 is the next major release, and is still undergoing fine tuning. If there is an interest, I will create versions of gccacovea that can test older versions of the compiler, such as 2.9x (still popular in many distributions) and 3.3.

Various settings change the types of options tested by the compiler. For example, invoking gccacovea with the -O3 switch will imply -O3 (and all its constituent options), whereas the default is to imply -O1 and evolve all options (including those in -O3). In default mode, gccacovea searches for the best set of options from all those Table 1; this is good for seeing if the program can evolve a set of options that produces faster code than -O3. When using -O3 as a base, the program only performs evolution on options not included in -O3 -- thus testing the Pentium 4 specific options, along with new and experimental options.

The other command-line options set the characteristics of the genetic algorithm, such as population size or migration rate. I tested a variety of settings in the GA, settling on what are now the default settings for the gccacovea program:

  # of populations: 5
   population size: 40
     survival rate: 10% (4)
    migration rate: 5% (2)
     mutation rate: 1%
    crossover rate: 100%
   fitness scaling: windowed
generations to run: 20

Results

For the purpose of this article, I evolved the best set of options for five benchmark programs. Each benchmark is accompanied by a chart showing how various optimization options fared against each other and an unoptimized compile.

  • none - no optimizations
  • O1, O2, O3 - baseline compiles with those collective options defined by GCC.
  • ga O1 - compiled with options evolved by gccacovea, with -O1 as the base level of optimization.
  • ga O3 - compiled with options evolved by gccacovea, with -O3 as the base level of optimization.

The chart for each benchmark shows the gain in performance with a given set of optimizations. For example, if an unoptimized program runs in 66.1 seconds, and the program compiled with -O3 runs in 34.5 seconds, the chart will show -O3 with a bar representing a 47.8% improvement in performance. By displaying the performance data in this manner, the charts display the performance gain for each optimization setting.

I did not test the -ffast-math optimizations for this article. Given the influence of -ffast-math on accuracy, I felt it best to avoid testing it until I complete programming on an accuracy benchmark. I am not satisfied with several older floating-point accuracy tests -- but that is a subject for a future article.

All timings were performed using the gavg program (included in the distribution), which averages several runs to produce a more accurate execution time.

Individual benchmark comments are as follows:

almabench

I've used almabench in a previous article about numerical performance and Linux. This program calculates the daily ephemeris (at noon) for the years 2000-2099, and is floating-point intensive.

As the chart shows, almabench does not lend itself to optimization; the program's performance is bound to the library functions sin and cos. This was the only benchmark to benefit significantly from -ffast-math, likely due to inline code generation for sin, cos, and other mathematical functions.

This is also the test where Acovea performed most poorly; it could not identify any improvements over default optimization sets. This is not surprising, considering that the difference between -O1 and -O3 is negligible. Evolution with the -ffast-math options enabled resulted in no better performance than using -O3 -ffast-math.

Evolving from -O1, gccacovea determined that the following flags are most important to optimization:

-finline-functions
-fpeel-loops
-mieee-fp

The last item in the list above surprised me a bit, since -mieee-fp enforces additional restrictions on floating-point math that, according to conventional wisdom, should make code run slower.

fftbench

The only C++ benchmark in the group, fftbench multiplies two enormous polynomials using the Fast Fourier Transform. This benchmark's performance is predicated on the compiler's ability to effectively compile the complex<> template.

Optimization provides between a 21 and 30% improvement in performance. -O2 is a 7.5% improvement over -O1, but -O3 provides little or no additional benefit.

Evolving from -O3, gccacovea identified -fomit-frame-pointer as the most important flag for improving performance over -O3. In an earlier anaylsis, Acovea identified -momit-leaf-frame-pointer as the important optimization; this flag, introduced with GCC 3.4, tells the compiler not to keep frame pointers in registers for leaf functions. However, that earlier analysis did not take into account the fact that -fomit-frame-pointer is not enabled by -O1 for Pentium compiles; thes two options are mutually exclusive. Acovea 3.3 accounts for this, and -fomit-frame-pointer appears to be the more efficacious choice. Omitting frame pointers accounts for a 2% improvement in generated code speed. 

Evolving from -O1, gccacovea determined that the following flags are most important to optimization:

-fforce-mem
-fschedule-insns
-fstrict-aliasing
-finline-function
-fomit-frame-pointer
-fmove-all-movables
-funit-at-a-time

The last two optimizations are not included in -O3, being recently introduced to GCC. The -fomit-frame-pointer switch, enabled by -O1 on some architectures, is not enabled by default for Pentium compiles, because its inclusion makes debugging difficult. However, when compiling for speed, -fomit-frame-pointer can improve execution speed.

huffbench

This program implements the Huffman compression algorithm. The code is not the tightest or fastest possible C code; rather, it is a relatively straight-forward implementation, providing opportunities for an optimizer to "do its thing."

As such, Huffbench is one of two benchmarks (the other being treebench) where optimization is of immense value. Acovea identified the following flags as most important to optimization:

-foptimize-sibling-calls
-fno-crossjumping
-fcse-skip-blocks
-fgcse
-fstrength-reduce
-frerun-cse-after-loop
-frerun-loop-opt
-fpeephole2
-fstrict-aliasing
-freorder-blocks
-finline-functions
-frename-registers

-fmove-all-movables
-freduce-all-givs
-ftracer

In the initial release of Acovea, I was unaware of new switches implied in GCC 3.4 by -O1. One of these switches, -fcrossjumping, is, in fact, a pessimization for some programs, including Huffbench. This is one of a half-dozen bugs that Acovea identified in optimizations for GCC 3.4; all have been reported to the GCC Bugzilla database. Most of the bugs involved various problems with the -fnew-ra option, which enables an experimental register allocator in the compiler.

When added to -O1, the 15 options above produce code that is more than 12% faster than that generated by using -O3 alone. The last three of those options (in red) are recent additions to GCC, and they are not implied by -O3. My initial assumption was that adding the three red options to -O3 would produce even faster code.

As with most assumptions, I was wrong.

Using -O3 -move-all-movables -freduce-all-givs -ftracer produced no improvement in performance over using -O3 alone. And compiling with just the 11 "green" options above produces a program that is slower than -O2 and barely faster than -O1. This suggests a conflict between options implied by -O3 and the three red options above.

Evolving from -O3 produced entirely different -- but understandable -- results. Acovea rejected the three "red" options above, which makes sense given that they pessimize code containing the "green" options implied in -O3. Acovea determined that adding the following options (to -O3) produced a 67% improvement in program speed:

-funroll-all-loops
-ftracer
-fnew-ra
-momit-leaf-frame-pointer

Huffbench begs for further investigation, to find the reasons for this interesting behavior. 

lpbench

Lpbench is a modernized version of the classic Linpack benchmark. My implementation is based on Bonnie Toy's translation of the original Fortran source code, including modifications by Jack Dongarra and Roy Longbottom.

This benchmark gave Acovea a mild case of indigestion. When compiling lpbench, the -fnew-ra option produces internal compiler errors in combination with other options. This is a known bug in mainline GCC (at the time of this writing).

Any optimization beyond -O1 was relatively insignificant, improving run times by at most a few percent. I should note that on the Pentium 3, -O3 -freduce-all-givs generates code that is 35% faster than -O3 alone -- yet on the Pentium 4, -freduce-all-givs does not appear to be of any benefit at all. I've found other cases where Acovea discovers different "best" optimizations on the Pentium 3 and Pentium 4; again, an item to add to my "to do" list. ;)

According to Acovea (and backed by testing) the option that accounts for the improvement of -O3 over -O1 is -fstrength-reduce; no other -O2 and -O3 flags seem to be a significant factor. Acovea suggests that adding -funroll-all-loops produces marginally faster code in comparison to -O3 alone.

treebench

This benchmark constructs and modifies a B-tree data structure. The original code came from my liblaurus database engine (which is still in development). Treebench manipulates arrays, allocates and frees memory, and includes recursion.

Like huffbench, treebench lends itself well to optimization; using -O3 is a definite advantage over -O1 and -O2. Acovea found the following flags to be most important in producing fast code:

-fgcse
-fpeephole2
-fschedule-insns2
-fregmove
-fstrict-aliasing
-finline-functions
-ftracer

Using the above set of options (with -O1) produced code that was faster than -O3 alone. Using -O3 with just -ftracer improved performance even more.

Conclusions

The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:

The settings for -O1, -O2, and -O3 are valid. Every option implied by -O3 may not be applicable to every program -- but enabling all those options does not seem to be detrimental, except in the case of huffbench. Several of the recently-added flags (-ftracer in particular) may be candidates for inclusion in -O2 or -O3 for future versions of the compiler. There has also been some discussion on the GCC Mailing list as to whether the -fcrossjumping option should be included in -O1.

Processor-specific options do not appear to be a major factor in performance on these benchmarks. I don't know if this is due to the nature of the processor, or if GCC can't take advantage of processor-specific instructions. I have double-checked my results; adding -mfpmath=sse (or any of its variants, or -msse) to a compile of almabench does not make the code run any faster. The only ia32-specific option that showed consistent value was -momit-leaf-frame-pointer.

Faster code can be produced after analysis by Acovea. The genetic algorithm was able to find sets of flags that produced faster code than that emitted by the default -O1, -O2, and -O3 options (with the exception of almabench, which is largely unoptimizable by nature). In many cases, this was due to the inclusion of "new" flags introduced with more recent version of GCC. Acovea discovers additional options that improve performance over that provided by the default settings; in essence, the genetic algorithm determines which of the non-standard and processor-specific options are most effective for a given algorithm.

Acovea works well with the concepts of profiling and unit testing. A well-designed program will encapsulate algorithms. For most programs, performance is predicated on specific algorithms that can be identified via profiling; Acovea can then be applied to those algorithms for finding the set of optimizations that produce the fastest code. Only rarely does an entire program need to be fully optimized; instead, optimization should be applied to specific, critical routines that perform concise tasks.

As always, I look forward to considered comments.

-- Scott


Copyright 2004
Scott Robert Ladd
All rights reserved.