From the experiments described below, I think it is highly probable that there are two significant problems with gcc 3.0 on x86:
On 8/01/01, I submitted this problem as a bug to both RedHat and gnu:
On 12/03/01, a user suggested I try submitting under a different catagory under gnu, since I had gotten no response. It has been resubmitted as:
So, I scoped the assembler written out by each compiler, and both 3.0 and 2.96-80 have changed their fetch scheduling algorithm. The inner loop here is register blocked, and it looks something like this (variables starting with r are registers, starting with p are pointers):
rA0 = *pA0; rB0 = *pB0; rA1 = pA0; rA2 = pA0; rA3 = pA0; rA4 = pA0; rC0_0 += rA0 * rB0; rC1_0 += rA1 * rB0; rC2_0 += rA2 * rB0; rC3_0 += rA3 * rB0; rC4_0 += rA4 * rB0;Now, 2.95.2 takes this code and intermixes the fetches with the fpu calls, but the others leave it in place. More than that, 2.96-80 and 3.0 will transform the scheduling to look like that if you throw the -fschedule-insns flag, even if you intermix the fetch and computations by hand.
My speculation I'll need to get access to a Pentium III with all of these compilers in order to confirm, but I'm guessing that this scheduling of all the fetches followed by all the computation must allow the hardware reordering to shine on the Pentium and not on the Athlon, and the gcc folks tested the new scheduler on the Pentium only (yes, by the way, -mcpu=athlon doesn't change the scheduling). Whatever the case for other architectures, this new scheduling algorithm is clearly a disaster for the Athlon, and needs to be fixed. NOTE: Experiments on the PIII seem to bear this out.
ATLAS, fortunately, can manually do the required fetching. The inner loop then looks something like:
rA0 = *pA0; rB0 = *pB0; rC0_0 += rA0 * rB0; rA1 = pA0; rC1_0 += rA1 * rB0; rA2 = pA0; rC2_0 += rA2 * rB0; rA3 = pA0; rC3_0 += rA3 * rB0; rA4 = pA0; rC4_0 += rA4 * rB0;With kernel programmed in this style, we now get this performance:
Note that if we throw the -fschedule-insns flag, even the performance of this hand-scheduled code is dropped back to 720Mflops, as the compiler thinks its helping you by transforming it back to the all fetch, all computation model.
Now, diffing the assembler produced by 2.95.2 and 2.96-80 shows 2.96 throwing in 24 opposed to 2 fxch instructions: i.e., it is being messy with the stack. However, it turns out the register stack optimizations have just moved to a higher level, at least for 2.96-80. Pumping up the optimization for 2.96-80 and 3.0 provides this performance:
Now, if we scope the assembler produced by 2.95.2 (either manually or auto-scheduled) and that produced by the manually-scheduled 2.96-80, they are essentially identical (diff --ignore-all-space comes in handy here). However, 3.0 is different, and slower. The 2.95.2 and 2.96-80 routines have two advantages: they have much less fxch instructions (2 versus 21), and they operate on the top of the stack more. From combining the Pentium III and Athlon results, fxch seems to be the thing that most negatively effects performance.
To understand better what is different in these various assembler outputs, scope out these instruction statistics for both architectures.
After the fetch pattern is fixed, 3.0 still does not produce as good a x87 register stack code as 2.96-80 or 2.95.
MMBENCH3/Makefile has macros for 3 different compilers, and their flags. You'll need to change:
GCC = (your 2.95 or older compiler here) ECC = (your 2.96-80 compiler here) CC3 = (your 3.0 compiler here)If you don't care about a particular compiler, just point it to another (eg, if you you don't have 2.96-80, set ECC=$(GCC)), and ignore those results.
Now, typing "make" will build and run the timers for the code where gcc should do the scheduling (the 1350/720/720 result). To see the manually scheduled timings (1350/1350/1234), do: "make mmrout=gemm_AtlasSched"
If you want to produce the other results, you can use these two commands while changing the compiler flags. To produce assembler files for a given flag setting, type rm -f *.s ; make assall.
|egcs 2.91.66||-fomit-frame-pointer -O||387|
|gcc 3.0||-fomit-frame-pointer -O||284|
|gcc 3.0||-fomit-frame-pointer -O -fschedule-insns||320|
Now, just as with the athlon, ATLAS can adapt itself to the new compiler. The above code actually attempts to utilize 11 registers, when there are only 8 available in the ISA. The new compiler must have different spill algorithms, because performance gets quite a bit better once ATLAS adapts to use 8 registers. Here is a table comparing the performance using a kernel adapted to gcc 3.0:
|egcs 2.91.66||-fomit-frame-pointer -O||366|
|gcc 3.0||-fomit-frame-pointer -O||337|
|gcc 3.0||-fomit-frame-pointer -O -fschedule-insns||320|
OK, with this code, the spill is no longer a problem, and 3.0 is able to improve to 337Mflops (or 87% of best 2.9x code). So, why is 3.0 over 10% slower than 2.9x? The assembler outputs are difficult to compare directly, but a look at some statistics is very helpful in answering this question. It looks like to me the greater number of fxch produced during fp stack utilization is the main culprit.
MMBENCH4/Makefile has macros for 3 different compilers, and their flags. You'll need to change:
GCC = (your 2.95 or older compiler here) ECC = (your 3.0 compiler here) CC3 = (your 3.0 compiler here (used with -fschedule-insns))Now, typing "make" will build and run the timers for the 2.9x adapted code (the 387/284/320 results). Typing "make mmrout=gemm_AtlasAdapt runs the 3.0 adapted kernel (366/337/320). To produce assembler files for a given flag setting, type rm -f *.s ; make assall.
|2.9Kern||3.0Kern||Scheduled Kern||Unscheduled Kern|
|# fp inst:||160||160||160||160||200||200||200||200|
|% stack top:||20.0||37.5||25.0||31.9||50.0||30.5||9.5||7.0|
I included the percentage of fp instructions operating on the top two elements of the register stack, because it seemed reasonable that getting this percentage pumped up would be the reason gcc 3.0 had so many fxch instructions. What we see is that as fxch count rises, so does % stack top for the Pentium III, but that for the Athlon, fxch count rises while % stack top drops, indicating yet again how badly the Athlon is treated by 3.0.
It's clear that this table does not capture all differences of interest (for instance, I really dobut that one extra fxch causes the 6% drop between the Pentium III/gcc2.9x 2.9 and 3.0 kernels), but the one thing you can say is that the kernel with the lowest number of fxch instructions always wins. If it is true that 3.0 is issuing manual fxch in order to operate more often on the top of the stack, it looks like it is not a good idea.