STM32 arm-none-eabi toolchain upgrade

Below is somewhat of a “diary” of upgrading the “arm-none-eabi-gcc” toolchain for Axoloti/Ksoloti. At first I thought I’d just replace the respective arm-none-eabi related files and folders in the platform_* folders, see what errors come up (there’d mainly be warnings due to different syntax standards, I assumed), and try to fix them in the firmware. I have encountered a few undocumented and obscure pitfalls however, which I wrote about in the Ksoloti Discord Server. Since this forum is accessible for everyone I thought I’d share the process here. May it be of help to anyone coming across here via a google search!

Original Axoloti v1 toolchain: arm-none-eabi-gcc 4.9 q2: 4.9-2015-q2-update : Series 4.9 : GNU Arm Embedded Toolchain

Upgrade to arm-none-eabi-gcc 9.3.1: Downloads | 9-2020-q2-update – Arm Developer

I have been experimenting with step-wise upgrades of the arm-none-eabi toolchain and I see increased SRAM usage between 4.9 (the original and currently used version) and newer GCC versions, using the same Makefile.patch.

In fact many demo patches do not compile because SRAM is far exceeded.

I did try a few other -O values and it seems -O3 is the only usable one as the others give really poor speed, i.e. DSP performance. The Gills Polysaw demo patch for example with 5 voice polyphony runs at around 91% DSP with -O3, and at other -O settings DSP exceeds 100% with only 2-3 voices.

Anyway i was wondering what I could look at that makes the SRAM usage increase …?

It seems like everything is a bit inflated, but especially the dsp() Threads RAM cost increases significantly. we see the dsp() thread of a subpatch called obj_1 explodes from 0x21c4 bytes to 0x4294 bytes. the main (root:) dsp() Thread increases from 0x2494 to 0x3460 bytes.

In decimal, that is about 12kbytes more RAM (or more precisely, .text data copied into RAM at runtime) for the same patch, same make, same firmware, just different arm-none-eabi.

I feel like there is some new feature introduced in GCC 5 and higher that will optimize differently.

After running arm-none-eabi toolchain with gcc 7.3.1, it compiles fine and I had to change some optimization settings (-O2 etc.) to get the SRAM usage down to what gcc 4.9 was producing. However there is 10-20 percent higher DSP load on tested patches.

I pulled the flag settings of all optimization levels using arm-none-eabi-gcc -Q -O[level...] --help-optimizers On both gcc versions to get an idea which flags the optimization levels set or unset.

But about 5-10% higher DSP is about as close as I can get so far. I have found the following to have a positive effect on RAM usage or DSP:

-freorder-blocks-algorithm=simple
-fschedule-fusion
-fno-schedule-insns (not sure)
-fno-partial-inlining (not sure)

But what makes my head hurt is that with -O3 and all relevant flags disabled, it still exceeds SRAM limits, whereas with -O2 it is within SRAM limits but DSP load goes up.

It turns out we really need -O3 (i.e. the flags it enables). Performance especially in poly patches is abysmal without it. So among the features that are switched on by -O3 in more recent GCC versions, there must be one that doesn’t play well with our SRAM usage.

After some more trial and error and even browsing through the gcc source code I believe I figured out which optimization flags steal our RAM.

Something to do with scheduling and vectorization… Setting the following flags reduces SRAM usage to standard Axoloti levels:

-fno-forward-propagate
-fno-partial-inlining
-fno-reorder-blocks
-fno-schedule-insns
-fno-schedule-insns2
-fvect-cost-model=cheap

Compilation may take a few seconds longer with complex patches but the time is spent optimizing memory so it’s fine I suppose.

  • Jumping right to GCC 9.2.1, then 9.3.1.

  • Tried GCC 10 + and the firmware compiled but crashed upon boot, and I can’t be bothered to keep looking for fixes right now.

  • 9.3.1 is fine so far.

It turns out it was not about vectorization or scheduling, but loop unrolling/peeling.

Via a tiny remark in some thread 15 years ago, I finally found the exact setting that would increase SRAM on newer GCC versions compared to 4.9 as used by original Axoloti.

https://linux-il.linux.org.il.narkive.com/8SnKn2yZ/disabling-loop-unrolling-in-gcc#post5

Thank you, people who were involved in that thread back then.

--param max-completely-peeled-insns=100

The setting is part of the unroll-loops or peel-loops optimization step. Very cryptic and there seems to be no further documentation except some comments in the GCC source.

In short, what changed in newer GCC versions is that they consider a loop worth being unrolled if the sum of instructions after unrolling is 200 or less. For older GCC versions this was set to 100.

What this means for Axoloti is that some loops, specifically in some filtering DSP and other fast functions, were unrolled (i.e. written out as a series of duplicates of the instructions inside the loop.) This increased their memory usage by up to 725%, (with little gain in speed).

On the plus side, all the trial and error and sniffing around the GCC source code may have brought a few other small optimizations into the game so now we may have a little less RAM usage and even a percent or two less DSP load in some cases, compared to original Axoloti settings.