Acovea: Analysis of Compiler Options via Evolutionary Algorithm
Results with Version 4.0.0 for Opteron and Pentium 4
by Scott Robert Ladd
30 May 2004
Related Articles
Acovea Overview
Analysis: Acovea 4.0.0 on Pentium 4 and Opteron
Acovea FAQ
A Description of the Genetic Algorithm
Abstract
ACOVEA (Analysis of Compiler Options via Evolutionary Algorithm) implements a genetic algorithm to find the "best" options for compiling programs with the GNU Compiler Collection (GCC) C and C++ compilers. "Best", in this context, is defined as those options that produce the fastest executable program from a given source code. Acovea is a C++ framework that can be extended to test other programming languages and non-GCC compilers.
This article describes results obtained by running Acovea on Pentium 4 and Opteron workstations. When applied to several example benchmark programs, Acovea identified optimization sets that reduced individual benchmark run times by as much as 40%, when compared against code generated using the compiler's predefined -On optimization sets. The 40% figure is a bit atypical, however; in most instances, Acovea found "improved" optimizations sets that reduced run times by 20 percent or less.
In two test cases, Acovea could not find any improvement over the standard -O options — although Acovea did produce useful information by proving, in one of those cases, that the best optimization was an unadorned -O1 (and not -O2 or -O3, as might be expected.)
Table 1: Benchmark Systems | ||
characteristic | Corwin | Tycho |
Motherboard | Tyan K8W 2885 | Intel D850EMV |
Processors | dual AMD Opteron model 240, 1.4GHz |
single Pentium 4 2.8GHz Northwood HyperThread-enabled |
RAM | 2GB PC3200 (1GB/cpu, NUMA) |
512MB PC800 RDRAM |
Hard Drive | 120GB ATA/133 | 80GB ATA/100 |
Linux distro | Gentoo-amd64 | Debian "sid" |
Linux Kernel | 2.6.3 NUMA/SMP | 2.6.3 SMP/HT aware |
glibc version | 2.3.2 | 2.3.2 |
binutils | 2.14.90.0.8 | 2.14.90.0.7 |
Acovea Changes and Test Systems
The original article appeared in late 2003, and covered my experiments with Acovea on a Pentium 4 workstation. At that time, Acovea defined compiler options and commands with C++ classes. That design grew both unwieldy and difficult to extend — people often asked if I could change Acovea to use XML configuration files. And that is exactly what I've done in version 4.0.0, upon which I've based this article.
In mid-March 2004, I purchased the dual Opteron system (Corwin) described in Table 1, which complements my year-old Pentium 4 (Tycho) machine. I chose the Opteron for many reasons, none of them particularly religious. If you're interested, check out the how's and why's of my foray into AMD processors..
This article is not a comparison of the Opteron and Pentium processors. Herein, I am writing about Acovea as it performs on the computers I own. While I have definite opinions about the Opteron and Pentium processors, those opinions belong in elsewhere. Suffice it to say that both Corwin and Tycho have proven to be excellent workstations for my day-to-day work.
GCC versions
All tests were performed on a recent snapshot of GCC 3.4. GCC 3.3 is the current and stable release; any discoveries I might make will not result in any improvement to a stable compiler, and one of my goals was to inspire improvements in the next versions of GCC. My previous Acovea tests resulted in some minor improvements to GCC 3.4, and it is there that I focus my efforts for this article.
However, in recognition of reader interest, Acovea 4.0.0 provides predefined compiler defintions for both GCC 3.4 and 3.3. If time permits, I'll run 3.3 tests myself. Once tree-ssa merges into mainline GCC development, I'll be testing it (and its new Fortran 95 compiler) as well.
Benchmark tests
The current benchmark suite consists of seven algorithm-specific programs, all written to the 1999 ISO C Standard. I plan to add a couple new benchmarks to that set, and create additional implementations in Fortran 95, C++, and other languages. Such grand plans are, of course, predicated on my finding some of that mythical "spare time" I keep hearing about...
The individual tests are:
-
alma
Calculates the daily ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos(). -
evo
A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math. -
fft
Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math. -
huff
Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks. -
lin
Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization. -
mat1
Multiplies two very large matrices using the brute-force algorithm; tests loop optimization. -
tree
Creates and modifies a large B-tree in memory; tests integer looping, and dynamic memory management.
It might seem that these tests have many things in common -- but, as I'll describe below, Acovea found these benchmarks to have very different optimization characteristics.
Compiler Options
For the purpose of this article, I evolved the best set of options for the seven benchmark programs listed above; I also created a composite program that invokes all seven benchmarks to produce a single, overall result. At left is a chart showing the relative performance of each benchmark as compiled with different optimization settings.
- no opt - no optimizations
- O1, O2, O3 - baseline compiles with those collective options defined by GCC.
- acovea - compiled with options evolved by gccacovea, with -O1 as the base level of optimization.
Acovea evolves optimization sets from the -O1 level. -O1 must be included if any optimization is to occur; I have included switches in the GCC configuration files to turn off various optimizations (e.g., -fno-merge-constants) implied by -O1, thus allowing evolution to remove options implied at the base level. For GCC 3.4 on the Pentium 4, Acovea evolves optimization sets selected from 64 options, some of which include multiple possibilities (e.g., -mfpath).
Recent discussions on the GCC mailing list (thread 1, thread 2, thread 3) revealed different perceptions about the implications of the -ffast-math option. For the tests shown here, I included -ffast-math in the evolutionary mix. I've performed experiments that show how -ffast-math does not reduce (and in some cases, improves) accuracy when used with industry-standard benchmarks like Paranoia. The "floating-point accuracy" story is very complicated, but it deserves its own, to-be-written-when-I-have-time article.
In the following discussion, I refer to "optimisms" and "pessimisms", which are short-hand terms for compiler options that improve or degrade the speed of executable code output from the compiler.

Intel Pentium 4 (Northwood core)
Astute readers will note that Tycho is running the "unstable" release of Debian GNU/Linux. I ran all tests from the bash prompt, in text mode, no X server, with a minimum of daemons present.
For every test compile, Acovea uses gcc -lrt -lm -std=gnu99 -O1 -march=pentium4; the -march=pentium4 option implies -mcpu=pentium4, -msse, -msse2, and -mmmx, so I don't explicitly specify those options.
Back to the chart! For each benchmark, five bars depict relative performace as compared to the fastest run-time. For example, on huff, Acovea identified a set of options that produced the fastest code; the shorter lines show how much slower huff ran with the default optimization levels. From fastest to slowest, the best optimizations for huff were (from fastest to slowest) acovea, -O1, -O3, -O2, and no optimization at all.
Such a result might seem odd at first; shouldn't -O2 (and -O3, for that matter) produce a faster program than does -O1? Yes, by intent — but in reality, more optimization does not always equal faster code. Sometimes, higher optimization levels produce code that is too large for execution in the processor's on-board cache; in other cases, "optimizations" can be pessimistic, especially in interaction with other options. Acovea identifies and discards pessimisms by weeding them out through natural selection. See the Acovea overview for more details on Acovea's statistical reporting.
For the seven benchmarks on the Pentium 4, Acovea produced the following results:
-
alma
Strong optimisms: -fschedule-insns -funsafe-math-optimizations
Strong pessimisms: -ffloat-store -fno-inline fnew-ra -mfpmath=sse
Acovea was unable to improve upon the options implied by -O3. In fact, there is little difference in performance between using any of the -On composites and Acovea's results. According to discussions on the GCC mailing list, this appears to be tied to a pessimistic interaction between GCC and the inline definitions of mathematical functions in the glibc headers.
-
evo
Strong optimisms: -fgcse
Strong pessimisms: -fbranch-target-load-optimize2 -mfpmath=sse -mfpmath=sse,387 -fomit-frame-pointer
While Acovea improved on the default optimization levels, it was only by a small amount.
-
fft
Strong optimisms: -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays
Strong pessimisms: -mfpmath=sse
All optimization sets perform roughly equally, and none improve greatly over an unoptimized compile. Acovea was unable to find an improve optimization set for this benchmark.
-
huff
Strong optimisms: -freorder-blocks -fnew-ra
Strong pessimisms: -fno-guess-branch-probability -fno-if-conversion -funroll-loops -mfpmath=387 -momit-leaf-frame-pointer
Acovea shines on this benchmark, improving performance by over 40% as compared to the next-best solution, -O1. The likely reason for this extreme success lies in the pessimistic options implied by -O1 — perhaps -fno-guess-branch-probability or -fno-if-conversion.
-
lin
Strong optimisms: -fstrict-aliasing
Strong pessimisms: -mfpmath=sse,387
Like fft, Acovea could not identify any major improvements, performing equally with -O1, -O3, and -O3 Acovea proves that, for this benchmark, there is no significant value in optimizations beyond the -O1 level.
-
mat1
Strong optimisms: -fstrength-reduce -fweb
Strong pessimisms: -fno-guess-branch-probability -fno-loop-optimize -ffloat-store -fbranch-target-load-optimize2
On this test, both -O3 and Acovea performed worse than did -O2. I find the results here puzzling — why does -O2 perform so much better than -O1 or -O3? — and I intend to run more, longer tests with Acovea.
-
tree
Strong optimisms: -fgcse -fstrict-aliasing -finline-functions
Strong pessimisms: -fno-inline
Acovea found a small improvement over -O3.
Using the corresponding Acovea-generated option sets for each benchmark reduced overall compile time by about 10%, as compared to a compile using -O3 -ffast-math; the resulting composite Acovea program ran about 7% faster that did one compiled with -O3 -ffast-math.

AMD Opteron
After reviewing Acovea's results for the Pentium 4, you might be less than impressed; I was somewhat disappointed, even though Acovea produced faster option sets for 3 benchmarks while also indentifying a nasty pessimisim for huff.
My evolutionary algorithm was much more successful, however, when applied to the Opteron. On five of seven benchmarks, Acovea identified optimization sets that produced faster code than did the -On options.
Corwin's Opterons run the first release of Gentoo's AMD64 port of Linux; the system is operating entirely on the 64-bit level. That means 64-bit libraries and versions of GCC that generate 64-bit x86_64 (or AMD64, if you like) object code. While Corwin does have dual processors, the current benchmarks are all serial. I intend to produce some parallel benchmarks when (yes, that phrase is coming) I have the time. :)
For every test compile on Corwin, Acovea used gcc -lrt -lm -std=gnu99 -O1 -march=opteron; the -march=opteron (which I'm told may be redundant for the AMD64 compiler) implies -msse, -msse2, -m3dnow, and -mmmx.
For the seven benchmarks on the Opteron, Acovea produced the following results:
-
alma
Strong optimisms: -fschedule-insns -mieee-fp -funsafe-math-optimizations
Strong pessimisms: -ffloat-store -mfpmath=sse
Acovea found an option set that was 10% faster than the default options, in contract to its lack of success on the Pentium 4. I found it interesting that Acovea identified the same optimistic and pessimistic options for both processors, even though these only showed significant improvements on the Opteron.
-
evo
Strong optimisms: -freorder-blocks -funsafe-math-optimizations
Strong pessimisms: -mfpmath=387
On this test, -O3 performed worse then -O2, which in turn produced slower code than did -O1; while Acovea improved on -O2 and -O3, it was beaten by -O1 as well. A long Acovea run (100 option sets in each of ten populations, run for 100 generation) found no way to improve on -O1.
-
fft
Strong optimisms: -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays
Strong pessimisms: -ffloat-store
The results here are similar to those from the Pentium 4 tests: Acovea was unable to find an improved optimization set for fft.
-
huff
Strong optimisms: -funroll-all-loops
Strong pessimisms: -fno-guess-branch-probability -fno-if-conversion -funroll-loops
As it did on the Pentium 4, Acovea shines on this benchmark, improving performance by more than 15% as compared to the next-best solution, -O3. Interestingly, the most optimistic options for the Opteron and Pentium 4 are completely different.
-
lin
Strong optimisms: -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays
Strong pessimisms: fno-loop-optimize -fbranch-target-load-optimize2 -mfpmath=sse
Acovea did very well on lin, finding am option set for producing code that is more than 20% faster than any -On option.
-
mat1
Strong optimisms: -fstrength-reduce
Strong pessimisms: none
While -O2 was better than -O3, Acovea found an option set that was even better. Natural selection also identified -fstrength-reduce as the most critical optimization that could be added to -O1, as it did on the Pentium 4.
-
tree
Strong optimisms: -finline-functions -fstrict-aliasing
Strong pessimisms: none
Acovea produced code that was a few percent faster the -O3; this is a case where inlining functions is detrimental to program performance.
Using the corresponding Acovea-generated option sets for each benchmark reduced overall compile time by about 10%, as compared to a compile using -O3 -ffast-math; the resulting composite Acovea program ran about 10% faster that did one compiled with -O3 -ffast-math.
The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:
Acovea identifies weaknesses in GCC.
In the case of huff on the Pentium 4, Acovea shows that something is clearly wrong
with the predefined optimization sets. During my other tests, natural selection uncovered
a half-dozen "internal compiler errors" bugs, many related to the -fnew-ra option.
As a QA tool for testing GCC, Acovea has proven useful.
Some processor-specific options still do not appear to be a
major factor in producing fast code.
Much to my surprise, I have yet to find any consistent evidence that options like
-mfpmath=sse improve program performance. Thus Acovea bears out my personal
experience, though it does not explain why so many people continue to suggest that
I should use -mfpmath=sse to generate floating-point code. If someone could suggest
a good "-mfpmath=sse", I'd appreciate seeing it.
More optimizations do not guarantee faster code.
This should be obvious, but it isn't. The complex interactions of many different optimization
techniques seem very likely to conflict; the complexity is simply too much to understand.
Acovea can test these interactions, and produce better optimization sets that
avoid conflicts and mutual pessimisms.
Different algorithms are most effective with different optimizations.
Again, an obvious pearl of wisdom that is sometimes forgotten. Real (i.e., non-benchmark)
programs do many things; using a profiler, the critical loops can be identified, and their
algorithms refined as much as possible by hand. Once the algorithms are tuned, a
tool like Acovea will identify the best possible optimization settings. Applying
-O3 to an entire program is not likely to produce the fastest program; using
algorithm-specific optimizations to compile specific, critical routines is likely to
produce faster code.
Future Directions
I've considered a number of extensions to the current Acovea system, including:
- The use of makefiles and CFLAGS for compiling more complex programs.
- Support for other programming languages, including Fortran 95.
- Compiler configurations for other versions of GCC and other processors.
- Even more documentation (but then again, what program doesn't need better docs?)
As always, I look forward to considered comments.
-- Scott