Callback runtime

Hello,
I have written a stepper motor driver that uses a variable hardware timer period for generating step pulses.
The hardware timer uses a CThunk to call a classes method from the the timer ISR. then this method is calling a callback for the user function of the timer component.
These two callbacks take about 4 µs runtime on my STM32F407 @ 168 MHz. This seems to be a lot, the calculation of the period time with several floating point operations take less than 2 µs.
Are callbacks so expensive? The program is compiled already as release with optimization. Times are measured with a saleae logic analyzer, sampling with 50 MHz, measuring the step pulse.

Edit:
I have published the source code here:

it works, but I haven’t tried to improve the callbacks runtime yet. The CThunk/Callbacks are so convient…

Very impressive! Especially as you run the display driver, the touch-screen driver and the stepper driver in parallel with each other.

The hardware timer uses a CThunk to call a classes method from the the timer ISR.

The CThunk/Callbacks are so convient…

There is no reference to the CThunk in the Callback documentation (Docs › API references and tutorials › Platform › Other platform APIs › Callback) and I was not able to find anything elsewhere about CThunk apart of the mbed-os/platform/include/platform/CThunk.h header file. However, according to the Wikipedia, Thunk is utilizing a dispatch table ( a table of pointers or memory addresses to functions or methods). Use of such a table is a common technique when implementing late binding (similar to VTABLE in case of virtual functions). So I think this could contribute to having longer callback execution times (similar as in case of virtual functions).

This seems to be a lot, the calculation of the period time with several floating point operations take less than 2 µs.

The STM32F407 is equipped with an FPU so the floating point calculations are very fast.

1 Like

Thanks for your feedback Zoltan,
yes, this was some test do all at the same time :slight_smile:

I have still built it with CLI1, but I wanted to convert some more libraries to CLI2 and then I will publish this code also.
Lvgl is a great library for GUI stuff and not difficult to use. The driver for the ILI9341 needs only one function to copy a render buffer (which is at minimum 10 lines x display width) to the display. This board uses a parallel 16 bit interface, which is very fast when driven by the FSMC interface on the F4. I’m using DMA with low priority, so the fast timer interrupts have low jitter.

I don’t know where I discovered the CThunk, it is used to bind the timer ISR to some class / method. An ISR has no parameters, so there are some levels of indirection and a static table is used. The number of CThunks is also limited and adjustable via a mbed_lib.json setting.
But also like the callbacks, it takes about 2µs, which is time for many instruction on a 168 MHz CPU. So I want to check further why it is taking so long. It maybe faster to use some own jumper table. I remember that the callbacks need good compilier optimization, but I’m using the release set already.

And yes, the FPU is amazing fast and there is really no need to try error prone integer optimization. One thing to remember is to always use float constants with suffix ‘f’, otherwise you will get quickly a penalty of several µs! The bad thing when converting libs for AVR controllers is that they didn’t about the difference between float and double. The AVR C-lib treats double as float and so there is difference on AVR, but on a tiny LPC8xx you blow your flash with one double instruction :slight_smile:

I bet they are expensive, what annoys me most is the call stack they produce. It can become annoying to read through a call stack with mbed callbacks during debugging.

I wonder why this was developed instead of using std::function - cppreference.com. A performance/feature comparison with std::function would be interesting…

Thanks, that sounds interesting, I haven’t used the std::function before.
The callbacks are used long time ago in mbed-os2 already, so I guess it is for historical reasons.
I will try to compare both variants.

First google hit for ‘std::function performance’ is

we use std::function for code not related to mbed.

a PR might also be welcome to extend the different API tu use std::function if it’s possible.

If possible, you can also try to trade convenience for speed by using static callbacks. Of course, in that case all the data members used in the callback have to be static as well (practically, such callback is a “global” C function wrapped in the class’s namespace). It’s a bit awkward technique I use in my Arduino libraries and in this mbed library.

Yes, we’ve actually also used this solution.

But it works only if you have one instance of the class or if all the instances can share the same callback. You can differentiate the caller by having an id parameter that’s different for each object.

I found some discussion about mbed::callback and std::function here:

But that is heafy stuff…

I have created a simple test for measuring the cycles for a call:

The ‘penalty’ for using callbacks looks not that big:

callback performance test
Hello from STM32F407VE_BLACK
Mbed OS version: 6.15.0

cyles staticFn          : 16  0.095 us
cyles callback staticFn : 35  0.208 us
cyles callback memberFn : 51  0.304 us

For the ISR, there should be additional 12+17 (for FPU) cycles.

edit:
ok, I added CThunk to the test, and there I have my 2 µs (when I add ISR cycles):

cyles cthunk            : 277  1.649 us 

Can I improve the runtime in my code by putting the code into RAM? How can I do this?

The STM32F407 is equipped with a 64kB of Core Coupled Memory (CCM) allowing 0-wait state execution. The CCM is usually used to store critical data. However, it can be used to store code instead of data. To use it we need to define this memory region inside the linker script as discussed in this post. But for storing code in CCM rather than data the ccm region should be modified as below:

.ccm :
{
    . = ALIGN(8);
    *(.ccm .ccm*)
} > CCM

To relocate a specific function inside the CCM we can use the GCC keyword __attribute__.
For example:

void __attribute__((section(".ccm"))) function_name() 
{
...
}

Thanks for reminding of my older thread, I had it already forgotten :slight_smile:
In my custom_target lib, I do not use a modified linker script yet, I must check how to include this in CLI2.

Another question about this: is it possible with CLI2 to define different linker script in the application configuration? I could create multiple custom target with different linker scripts, but I’m not sure if this is a good practice.
Different linker scripts are needed when using F7/H7 with RAM at different busses, for DMA its neccessary to have control over the used RAM sections.

For optimization of HWTimer, I wll remove now the CThunk and use a fixed number of static handlers that are assigned during Timer initialization.

Is it possible with CLI2 to define different linker script in the application configuration?

What concerns the CLI2 so far I have manged to take only four steps in the online CMake tutorial. So I’m afraid that this question should be answered by more skilled guys (like Ladislas, Bora, Jamie Smith …).

1 Like

which online tutorial do you mean? The link is not working.

Yes, learning cmake takes some time. Yesterday, I have prepared the sources for my StepperController project for CLI2 and stepped into the same trap as before. The logfiles and generated files are hard to read.

For some commands the order in CMakeLists matters, but in general, I like it already.

About the CThunk: it is used in I2C, SPI and UART interrupt driven async API, so it looks like it is only missing in the documentation. But I’m still thinking about replacing it by static ISR handlers.

which online tutorial do you mean? The link is not working.

I’m sorry. I pasted a wrong link :frowning: It should work now.

By “custom target”, you mean hardware or cmake? I’ll assume hardware for my answer.

I haven’t tried it, but in theory you could. Having a linker script/custom target does make sense as your custom target might be used for different things.

That being said, you will need to recompile everything when you change your custom target as configuration is global for one hardware target.

In the future it would be nice to be able to have multiple hardware targets used for different cmake targets. Especially useful when you have a MCU with two or more cores or if you have more than one mcu on your custom board and want to build firmware for each one of them.

The way we “handle” that for now is to use different build directories for cmake configuration steps. For example we have one for unit tests, one for the tools, one for the main product, one for the prototypes

with ‘custom target’ I mean it in the way as this term is used in Mbed. So yes, its hardware that is not included in the main mbed-os repo. Especially for STM32 its easy to derive from existing MCU definitions, thanks to Jerome.

When I understand it right, for each MCU a mbed-mcu is created, e.g. mbed-stm32f103x8

This library contains the definitions for startup code and linkerscript. Then a particular hardware has definitions for its peripherial pins and is linked to its mbed-mcu.
So I need to create a custom target with an own mbed-mcu definition to use a different .ld file? I don’t know if mbed_set_linker_script can be overwritten, haven’t checked it yet.

And one missing piece in the puzzle for me is, how is a target like e.g. NUCLEO_STM32F103RB coming into the game so that this link chain is choosen?

ok, I found an answer for the last question:
the magic is in mbed-tools configure, this writes the

Then this mbed_config.cmake is used in mbed–os/CMakeLists.txt

Yes, that would be it. Not sure it would work out of the box but with cmake we’ll find a way.

I don’t think you would need that.

A callback function is a function passed into another function as an argument , which is then invoked inside the outer function to complete some kind of routine or action.