Acovea: Analysis of Compiler Options via Evolutionary Algorithm

Results with Version 4.0.0 for Opteron and Pentium 4


by Scott Robert Ladd
30 May 2004
 


Related Articles
Acovea Overview
Analysis: Acovea 4.0.0 on Pentium 4 and Opteron
Acovea FAQ
A Description of the Genetic Algorithm

Abstract

ACOVEA (Analysis of Compiler Options via Evolutionary Algorithm) implements a genetic algorithm to find the "best" options for compiling programs with the GNU Compiler Collection (GCC) C and C++ compilers. "Best", in this context, is defined as those options that produce the fastest executable program from a given source code. Acovea is a C++ framework that can be extended to test other programming languages and non-GCC compilers.

This article describes results obtained by running Acovea on Pentium 4 and Opteron workstations. When applied to several example benchmark programs, Acovea identified optimization sets that reduced individual benchmark run times by as much as 40%, when compared against code generated using the compiler's predefined -On optimization sets. The 40% figure is a bit atypical, however; in most instances, Acovea found "improved" optimizations sets that reduced run times by 20 percent or less.

In two test cases, Acovea could not find any improvement over the standard -O options — although Acovea did produce useful information by proving, in one of those cases, that the best optimization was an unadorned -O1 (and not -O2 or -O3, as might be expected.)

Table 1: Benchmark Systems
characteristic Corwin Tycho
Motherboard Tyan K8W 2885 Intel D850EMV
Processors dual AMD Opteron
model 240, 1.4GHz
single Pentium 4
2.8GHz Northwood
HyperThread-enabled
RAM 2GB PC3200
(1GB/cpu, NUMA)
512MB PC800 RDRAM
Hard Drive 120GB ATA/133 80GB ATA/100
Linux distro Gentoo-amd64 Debian "sid"
Linux Kernel  2.6.3 NUMA/SMP 2.6.3 SMP/HT aware
glibc version 2.3.2 2.3.2
binutils 2.14.90.0.8 2.14.90.0.7

Acovea Changes and Test Systems

The original article appeared in late 2003, and covered my experiments with Acovea on a Pentium 4 workstation. At that time, Acovea defined compiler options and commands with C++ classes. That design grew both unwieldy and difficult to extend — people often asked if I could change Acovea to use XML configuration files. And that is exactly what I've done in version 4.0.0, upon which I've based this article.

In mid-March 2004, I purchased the dual Opteron system (Corwin) described in Table 1, which complements my year-old Pentium 4 (Tycho) machine. I chose the Opteron for many reasons, none of them particularly religious. If you're interested, check out the how's and why's of my foray into AMD processors..

This article is not a comparison of the Opteron and Pentium processors. Herein, I am writing about Acovea as it performs on the computers I own. While I have definite opinions about the Opteron and Pentium processors, those opinions belong in elsewhere. Suffice it to say that both Corwin and Tycho have proven to be excellent workstations for my day-to-day work.

GCC versions

All tests were performed on a recent snapshot of GCC 3.4. GCC 3.3 is the current and stable release; any discoveries I might make will not result in any improvement to a stable compiler, and one of my goals was to inspire improvements in the next versions of GCC. My previous Acovea tests resulted in some minor improvements to GCC 3.4, and it is there that I focus my efforts for this article.

However, in recognition of reader interest, Acovea 4.0.0 provides predefined compiler defintions for both GCC 3.4 and 3.3. If time permits, I'll run 3.3 tests myself. Once tree-ssa merges into mainline GCC development, I'll be testing it (and its new Fortran 95 compiler) as well.

Benchmark tests

The current benchmark suite consists of seven algorithm-specific programs, all written to the 1999 ISO C Standard. I plan to add a couple new benchmarks to that set, and create additional implementations in Fortran 95, C++, and other languages. Such grand plans are, of course, predicated on my finding some of that mythical "spare time" I keep hearing about...

The individual tests are:

  • alma
    Calculates the daily ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos().
  • evo
    A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math.
  • fft
    Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math.
  • huff
    Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks.
  • lin
    Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization.
  • mat1
    Multiplies two very large matrices using the brute-force algorithm; tests loop optimization.
  • tree
    Creates and modifies a large B-tree in memory; tests integer looping, and dynamic memory management.

It might seem that these tests have many things in common -- but, as I'll describe below, Acovea found these benchmarks to have very different optimization characteristics.

Compiler Options

For the purpose of this article, I evolved the best set of options for the seven benchmark programs listed above; I also created a composite program that invokes all seven benchmarks to produce a single, overall result. At left is a chart showing the relative performance of each benchmark as compiled with different optimization settings.

  • no opt - no optimizations
  • O1, O2, O3 - baseline compiles with those collective options defined by GCC.
  • acovea - compiled with options evolved by gccacovea, with -O1 as the base level of optimization.

Acovea evolves optimization sets from the -O1 level. -O1 must be included if any optimization is to occur; I have included switches in the GCC configuration files to turn off various optimizations (e.g., -fno-merge-constants) implied by -O1, thus allowing evolution to remove options implied at the base level. For GCC 3.4 on the Pentium 4, Acovea evolves optimization sets selected from 64 options, some of which include multiple possibilities (e.g., -mfpath).

Recent discussions on the GCC mailing list (thread 1, thread 2, thread 3) revealed different perceptions about the implications of the -ffast-math option. For the tests shown here, I included -ffast-math in the evolutionary mix. I've performed experiments that show how -ffast-math does not reduce (and in some cases, improves) accuracy when used with industry-standard benchmarks like Paranoia. The "floating-point accuracy" story is very complicated, but it deserves its own, to-be-written-when-I-have-time article.

In the following discussion, I refer to "optimisms" and "pessimisms", which are short-hand terms for compiler options that improve or degrade the speed of executable code output from the compiler.

Intel Pentium 4 (Northwood core)

Astute readers will note that Tycho is running the "unstable" release of Debian GNU/Linux. I ran all tests from the bash prompt, in text mode, no X server, with a minimum of daemons present.

For every test compile, Acovea uses gcc -lrt -lm -std=gnu99 -O1 -march=pentium4; the -march=pentium4 option implies -mcpu=pentium4, -msse, -msse2, and -mmmx, so I don't explicitly specify those options.

Back to the chart! For each benchmark, five bars depict relative performace as compared to the fastest run-time. For example, on huff, Acovea identified a set of options that produced the fastest code; the shorter lines show how much slower huff ran with the default optimization levels. From fastest to slowest, the best optimizations for huff were (from fastest to slowest) acovea, -O1, -O3, -O2, and no optimization at all.

Such a result might seem odd at first; shouldn't -O2 (and -O3, for that matter) produce a faster program than does -O1? Yes, by intent — but in reality, more optimization does not always equal faster code. Sometimes, higher optimization levels produce code that is too large for execution in the processor's on-board cache; in other cases, "optimizations" can be pessimistic, especially in interaction with other options. Acovea identifies and discards pessimisms by weeding them out through natural selection. See the Acovea overview for more details on Acovea's statistical reporting.

For the seven benchmarks on the Pentium 4, Acovea produced the following results:

  • alma
    Strong optimisms: -fschedule-insns -funsafe-math-optimizations
    Strong pessimisms: -ffloat-store -fno-inline fnew-ra -mfpmath=sse

    Acovea was unable to improve upon the options implied by -O3. In fact, there is little difference in performance between using any of the -On composites and Acovea's results. According to discussions on the GCC mailing list, this appears to be tied to a pessimistic interaction between GCC and the inline definitions of mathematical functions in the glibc headers.
     
  • evo
    Strong optimisms: -fgcse
    Strong pessimisms: -fbranch-target-load-optimize2 -mfpmath=sse -mfpmath=sse,387 -fomit-frame-pointer

    While Acovea improved on the default optimization levels, it was only by a small amount.
     
  • fft
    Strong optimisms: -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays
    Strong pessimisms: -mfpmath=sse

    All optimization sets perform roughly equally, and none improve greatly over an unoptimized compile. Acovea was unable to find an improve optimization set for this benchmark.
     
  • huff
    Strong optimisms: -freorder-blocks -fnew-ra
    Strong pessimisms: -fno-guess-branch-probability -fno-if-conversion -funroll-loops -mfpmath=387 -momit-leaf-frame-pointer

    Acovea shines on this benchmark, improving performance by over 40% as compared to the next-best solution, -O1. The likely reason for this extreme success lies in the pessimistic options implied by -O1 — perhaps -fno-guess-branch-probability or -fno-if-conversion.
     
  • lin
    Strong optimisms: -fstrict-aliasing
    Strong pessimisms: -mfpmath=sse,387

    Like fft, Acovea could not identify any major improvements, performing equally with -O1, -O3, and -O3 Acovea proves that, for this benchmark, there is no significant value in optimizations beyond the -O1 level.
     
  • mat1
    Strong optimisms: -fstrength-reduce -fweb
    Strong pessimisms: -fno-guess-branch-probability -fno-loop-optimize -ffloat-store -fbranch-target-load-optimize2

    On this test, both -O3 and Acovea performed worse than did -O2. I find the results here puzzling — why does -O2 perform so much better than -O1 or -O3? — and I intend to run more, longer tests with Acovea.
     
  • tree
    Strong optimisms: -fgcse -fstrict-aliasing -finline-functions
    Strong pessimisms: -fno-inline

    Acovea found a small improvement over -O3.

Using the corresponding Acovea-generated option sets for each benchmark reduced overall compile time by about 10%, as compared to a compile using -O3 -ffast-math; the resulting composite Acovea program ran about 7% faster that did one compiled with -O3 -ffast-math.

AMD Opteron

After reviewing Acovea's results for the Pentium 4, you might be less than impressed; I was somewhat disappointed, even though Acovea produced faster option sets for 3 benchmarks while also indentifying a nasty pessimisim for huff.

My evolutionary algorithm was much more successful, however, when applied to the Opteron. On five of seven benchmarks, Acovea identified optimization sets that produced faster code than did the -On options.

Corwin's Opterons run the first release of Gentoo's AMD64 port of Linux; the system is operating entirely on the 64-bit level. That means 64-bit libraries and versions of GCC that generate 64-bit x86_64 (or AMD64, if you like) object code. While Corwin does have dual processors, the current benchmarks are all serial. I intend to produce some parallel benchmarks when (yes, that phrase is coming) I have the time. :)

For every test compile on Corwin, Acovea used gcc -lrt -lm -std=gnu99 -O1 -march=opteron; the -march=opteron (which I'm told may be redundant for the AMD64 compiler) implies -msse, -msse2, -m3dnow, and -mmmx.

For the seven benchmarks on the Opteron, Acovea produced the following results:

  • alma
    Strong optimisms: -fschedule-insns -mieee-fp -funsafe-math-optimizations
    Strong pessimisms: -ffloat-store -mfpmath=sse

    Acovea found an option set that was 10% faster than the default options, in contract to its lack of success on the Pentium 4. I found it interesting that Acovea identified the same optimistic and pessimistic options for both processors, even though these only showed significant improvements on the Opteron.
     
  • evo
    Strong optimisms: -freorder-blocks -funsafe-math-optimizations
    Strong pessimisms: -mfpmath=387

    On this test, -O3 performed worse then -O2, which in turn produced slower code than did -O1; while Acovea improved on -O2 and -O3, it was beaten by -O1 as well. A long Acovea run (100 option sets in each of ten populations, run for 100 generation) found no way to improve on -O1.
     
  • fft
    Strong optimisms: -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays
    Strong pessimisms: -ffloat-store

    The results here are similar to those from the Pentium 4 tests: Acovea was unable to find an improved optimization set for fft.
     
  • huff
    Strong optimisms: -funroll-all-loops
    Strong pessimisms: -fno-guess-branch-probability -fno-if-conversion -funroll-loops

    As it did on the Pentium 4, Acovea shines on this benchmark, improving performance by more than 15% as compared to the next-best solution, -O3. Interestingly, the most optimistic options for the Opteron and Pentium 4 are completely different.
     
  • lin
    Strong optimisms: -fstrength-reduce -fstrict-aliasing -fprefetch-loop-arrays
    Strong pessimisms: fno-loop-optimize -fbranch-target-load-optimize2 -mfpmath=sse

    Acovea did very well on lin, finding am option set for producing code that is more than 20% faster than any -On option.
     
  • mat1
    Strong optimisms: -fstrength-reduce
    Strong pessimisms: none

    While -O2 was better than -O3, Acovea found an option set that was even better. Natural selection also identified -fstrength-reduce as the most critical optimization that could be added to -O1, as it did on the Pentium 4.
     
  • tree
    Strong optimisms: -finline-functions -fstrict-aliasing
    Strong pessimisms: none

    Acovea produced code that was a few percent faster the -O3; this is a case where inlining functions is detrimental to program performance.
     

Using the corresponding Acovea-generated option sets for each benchmark reduced overall compile time by about 10%, as compared to a compile using -O3 -ffast-math; the resulting composite Acovea program ran about 10% faster that did one compiled with -O3 -ffast-math.

Conclusions

The results I've presented describe the effects of different optimizations on a set of relatively straight-forward programs that perform well-defined tasks. My conclusions so far have been these:

Acovea identifies weaknesses in GCC.
In the case of huff on the Pentium 4, Acovea shows that something is clearly wrong with the predefined optimization sets. During my other tests, natural selection uncovered a half-dozen "internal compiler errors" bugs, many related to the -fnew-ra option. As a QA tool for testing GCC, Acovea has proven useful.

Some processor-specific options still do not appear to be a major factor in producing fast code.
Much to my surprise, I have yet to find any consistent evidence that options like -mfpmath=sse improve program performance. Thus Acovea bears out my personal experience, though it does not explain why so many people continue to suggest that I should use -mfpmath=sse to generate floating-point code. If someone could suggest a good "-mfpmath=sse", I'd appreciate seeing it.

More optimizations do not guarantee faster code.
This should be obvious, but it isn't. The complex interactions of many different optimization techniques seem very likely to conflict; the complexity is simply too much to understand. Acovea can test these interactions, and produce better optimization sets that avoid conflicts and mutual pessimisms.

Different algorithms are most effective with different optimizations.
Again, an obvious pearl of wisdom that is sometimes forgotten. Real (i.e., non-benchmark) programs do many things; using a profiler, the critical loops can be identified, and their algorithms refined as much as possible by hand. Once the algorithms are tuned, a tool like Acovea will identify the best possible optimization settings. Applying -O3 to an entire program is not likely to produce the fastest program; using algorithm-specific optimizations to compile specific, critical routines is likely to produce faster code.

Future Directions

I've considered a number of extensions to the current Acovea system, including:

  • The use of makefiles and CFLAGS for compiling more complex programs.
  • Support for other programming languages, including Fortran 95.
  • Compiler configurations for other versions of GCC and other processors.
  • Even more documentation (but then again, what program doesn't need better docs?)

As always, I look forward to considered comments.

-- Scott


Copyright 2004
Scott Robert Ladd
All rights reserved.