Trying to optimize the math evaluator

OK, here’s a quick update on the progress made on G’MIC development over the last few days.
I wanted to start optimizing the mathematical expression evaluator integrated into G’MIC. As you probably know, this expression evaluator has become more complex with each new release of G’MIC, and since it is now quite widely used in the code for the stdlib commands and filters of the G’MIC-Qt plug-in, I naturally thought that if we could save a few percent off the machine time spent evaluating mathematical expressions, it could really be a good thing!

So I focused on the “design principle” that seemed most likely to be optimized in the math evaluator: the way in which successive operations (e.g., arithmetic operations +, -, *, /, … or calls to built-in functions such as cos(), sin(), …) were called during evaluation.

It is important to know that this math evaluator operates in two phases: the first phase involves parsing and compiling the mathematical expression into a sequence of bytecodes, which actually describes the set of “atomic” operations to be performed in order to evaluate the expression. Running the interpreter of this bytecode sequence is precisely the second phase of the whole evaluation process.

TBH, I didn’t see many ways to significantly speed up the compilation phase. First, because the current parsing/compilation code seems relatively well done to me (apart from the fact it uses recursivity, but I don’t feel we can gain much from “unrolling” the process in a more sequential algorithm). And second, because this first compilation phase is always executed once when evaluating an expression, and the machine time savings that can be achieved should come more easily from interpreting the bytecode sequence, since it is often run many times (typically in the case of filling an image with a fill expression, such as fill “cos(x)*sin(u)”). To be precise, as many times as you have pixels in an image, so often thousands / millions of time ! Saving one or two milliseconds for each pixel would mean a lot of time saved in the end.

So I decided to try to optimize the interpretation of the expression bytecode once it had been compiled.
To do this, and to get some potentially interesting ideas, I even asked ChatGPT, which gave me some advice that I found interesting (and which actually corresponded to the idea I had in mind before asking): Try to optimize the “dispatch” of function calls.
When I evaluated the pre-compiled bytecode sequence, I was indeed using pointers to static functions (for example, the mp_add() function to add two values together), with the disadvantages that this entails: stacking arguments, jumping to functions (which cannot be inline in my case), etc.
So trying to avoid these indirections and jumps by avoiding the use of function pointers and instead trying to have a loop of inline function dispatch directly in my evaluation loop seemed like a pretty good idea :heart_eyes:

To cut a long story short: in the end, no, it was a really bad idea. :sweat_smile:

But the only way to get to this conclusion was to code it and make it work.
It took me a whole day’s work, because I had to change quite a few things in the mathematical expression evaluator to make it possible. Basically, I had to replace all uses of function pointers with integer identifiers (enum), and use these identifiers in a big switch(id) { ... } loop with these functions inlined during evaluation.

At this point, that’s when I found out that my mathematical expression evaluator is still quite comprehensive (it defines more than 350 built-in functions), which I realized a little too late, by the way.
Because trying to inline all these functions into a single large evaluation loop turned out to be a pretty bad idea: what we gain in function dispatch time, we lose because the bytecode evaluation loop we get no longer fits as well into the CPU cache. And after a good day’s work, I realize that I’m now down 25% in performance on my mathematical expression evaluator. Argh!

So finally, I hadto go back to the old bytecode evaluation model, which used function pointers, in a loop that was very small and could therefore make better use of the CPU cache, despite the additional time required to use function pointers.

The life of a developer also involves trying things that don’t work!

It could have ended there, with this big disappointment, but since I had already dug deep into the evaluator’s code, it seemed a shame not to take advantage of it while it was still fresh on my mind.

So I tried another optimization approach: searching for common patterns in the generated bytecode in order to define functions that perform several calculations at the same time corresponding to these patterns, but which only need to be called once.

Let me explain shortly: If you have an expression of the type A + B + C, where A, B, and C can be anything, (e.g. expressions around parentheses themselves), then instead of performing two successive additions temp = B + C then A + temp, you can define a single function add_add that performs two additions, and call it only once, with three arguments A, B, and C.

And if you do this for all cases of two successive operations, including the most common operators (+,-,*,/), you end up with 2^4 = 16 possible functions that calculate double operations.

And it turns out that these patterns occur very frequently. Think about all the expressions containing things like a*b + c, a*b*c, a - b/c, etc. So at the end, we gain a small advantage by calling a single function instead of two successive operations.
Even better, if in these double operations we are lucky enough to detect two constants, such as in the expression 3*cos(u)*2, it is easy to simplify this expression to 6*cos(u), and replace two simple operations with a single simple operation (in this case, two multiplications with a single one).

Well, that’s what I’ve been doing these past few days on the CImg/G’MIC code, and what I wanted to talk to you about.
It’s not super sexy, it’s not super impressive to sell, but it’s still a lot of work. And it made me realize that this mathematical expression evaluator in G’MIC was already quite good :slight_smile:
The gain with these new optimisations is still pretty minimal “on average,” except when the expressions we’re trying to evaluate have those typical patterns that are now optimized.

Today I released binaries for version 3.7.0_pre, with these improvements to the mathematical expression evaluator. If you want to test it and report any bugs, I’m all ears :slight_smile:

Cheers,

David.

3 Likes

I forgot to talk about the real kind of gains achieved with this latest optimization.
As you will see, the least we can say is that the gains are not that impressive!

Comparing G’MIC / ImageMagick’s FX option (IM6)

I also wanted to take this opportunity to compare the mathematical expression evaluator with the one in ImageMagick (version 6, which is quite old, but is the one pre-installed on my machine), provided by its option -fx.
So I retrieved the example that generates the Julia fractal from the IM web page: ImageMagick | The FX Special Effects Image Operator and adapted it for G’MIC, trying to stay as close as possible to the original script.
So, here is the script I use to compare the math expression evaluators of IM and G’MIC :

# Compare ImageMagick's -fx option with G'MIC math expression evaluator.
comp_im_gmic : check "isint(${1=3},1)"

  # Imagemagick 6.
  if $1&1
    tic
    exec "convert -size 400x400 xc:gray -fx \"
           Xi = 2.4*i/w - 1.2;
           Yj = 2.4*j/h - 1.2;
           for (pixel = 0.0, (hypot(Xi,Yj)<2.0) && (pixel<1.0),
             delta = Xi^2 - Yj^2;
             Yj = 2.0*Xi*Yj + 0.2;
             Xi = delta + 0.4;
             pixel+=0.00390625
           );
           pixel == 1.0 ? 0.0 : pixel\"
         \\( -size 1x1 xc:white xc:red xc:orange xc:yellow xc:green1 xc:cyan xc:blue xc:blueviolet xc:white
            -reverse +append -filter Cubic -resize 1024x1! \\)
         -clut -rotate -90 julia-set.png"
    e "Time with ImageMagick 6:" toc
  fi

  # G'MIC version.
  if $1&2
    tic
    400,400,1,1,":
      Xi = 2.4*x/(w - 1) - 1.2;
      Yj = 2.4*y/(h - 1) - 1.2;
      for (pixel = 0, (hypot(Xi,Yj)<2.0) && (pixel<1.0),
        delta = Xi^2 - Yj^2;
        Yj = 2.0*Xi*Yj + 0.2;
        Xi = delta + 0.4;
        pixel+=0.00390625
      );
      pixel == 1.0 ? 0.0 : pixel"
    n. 0,1023
    l[] {
      repeat 9 { arg0 $>,white,red,orange,yellow,green,cyan,blue,violet,white (${"name2color "${}}:c) }
      rv a x r 1024,1,1,3,5 c 0,255
    }
    map.. . rm.
    rotate. 90
    o julia-set2.png
    e "Time with G'MIC 3.7.0:" toc
  fi

and here is the corresponding comparison result:

~$ gmic comp_im_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
Time with ImageMagick 6:
[gmic]./comp_im_gmic/*if#432/ Elapsed time: 16.156 s.
Time with G'MIC 3.7.0:
[gmic]./comp_im_gmic/*if#451/ Elapsed time: 0.055 s.
[gmic]./ End G'MIC interpreter.

We’re doing not so bad :wink:
It would be fair to compare with latest IM version though (EDIT: See my next post for that).

Comparing G’MIC 3.7.0_pre / G’MIC 3.6.6

As the generation of a 400x400 julia image is quite fast with G’MIC, I wrote this custom script for comparing computation time for different versions of G’MIC, rendering the julia fractal 100 times, with a 4000x4000 resolution.

# Compare G'MIC math expression evaluator for different versions.
comp_gmic :
  s_duration=0
  repeat 100 {
    tic=$|
    4000,4000,1,1,":
      Xi = 2.4*x/(w - 1) - 1.2;
      Yj = 2.4*y/(h - 1) - 1.2;
      for (pixel = 0, (hypot(Xi,Yj)<2.0) && (pixel<1.0),
        delta = Xi^2 - Yj^2;
        Yj = 2.0*Xi*Yj + 0.2;
        Xi = delta + 0.4;
        pixel+=0.00390625
      );
      pixel == 1.0 ? 0.0 : pixel"
    toc=$|
    rm.
    duration:=$toc-$tic
    s_duration+=$duration
  }
  e "Avg duration for G'MIC "${-strver}": "{_$s_duration/100}"s."

And here is the comparison result:

~$ time gmic_370 comp_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
Avg duration for G'MIC 3.7.0: 1.30684s.
[gmic]./ End G'MIC interpreter.

real    2m11,090s
user    30m46,180s
sys     0m2,569s
~$ time gmic_366 comp_gmic
[gmic]./ Start G'MIC interpreter (v.3.6.6).
Avg duration for G'MIC 3.6.6: 1.32391s.
[gmic]./ End G'MIC interpreter.

real    2m12,924s
user    31m24,033s
sys     0m3,909s

This gives us a staggering -1.28% gain in computing time!
Was it worth all the time spent on it? Not sure, but now that it’s done, we’re not going to complain :stuck_out_tongue:

Right now, I feel that G’MIC’s mathematical expression evaluator is quite robust, and that it will be difficult to optimize it in any dramatic way.

1 Like

To be fair, I have retrieved the latest source code from IM (v.7.x), directly from their repository, compiled it and did the julia comparison again.
Kudos to them, they have improved the speed of option -fx a lot!

Comparing G’MIC / ImageMagick’s FX option (IM7)

Here are the updated comparison results:

$ gmic comp_im_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
Time with ImageMagick 7:
[gmic]./comp_im_gmic/*if#412/ Elapsed time: 0.508 s.
Time with G'MIC 3.7.0:
[gmic]./comp_im_gmic/*if#431/ Elapsed time: 0.058 s.
[gmic]./ End G'MIC interpreter.

G’MIC is still a bit faster, but not that much.
Trying with a 4000x4000 image (rather than 400x400), the ratio is roughly the same:

$ gmic comp_im_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
Time with ImageMagick 7:
[gmic]./comp_im_gmic/*if#412/ Elapsed time: 23.86 s.
Time with G'MIC 3.7.0:
[gmic]./comp_im_gmic/*if#431/ Elapsed time: 2.226 s.
[gmic]./ End G'MIC interpreter.

In any case, it’s a huge improvement between versions 6 and 7 of ImageMagick! And all this reinforces my belief that G’MIC’s mathematical expression evaluator is really not bad at all!

1 Like

Gave it a spin just in case you want some more numbers :

$ gmic gmic_im_com.gmic comp_im_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
[gmic]./ Input custom command file 'gmic_im_com.gmic' (1 new, total: 4683).WARNING: The convert command is deprecated in IMv7, use "magick" instead of "convert" or "magick convert"

Time with ImageMagick 7:
[gmic]./comp_im_gmic/*if/ Elapsed time: 10.441 s.
Time with G'MIC 3.7.0 NEW:
[gmic]./comp_im_gmic/*if/ Elapsed time: 1.482 s.
[gmic]./ End G'MIC interpreter.



$ ~/gmic_backup/gmic gmic_im_com.gmic comp_im_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
[gmic]./ Input custom command file 'gmic_im_com.gmic' (1 new, total: 4684).WARNING: The convert command is deprecated in IMv7, use "magick" instead of "convert" or "magick convert"

Time with ImageMagick 7:
[gmic]./comp_im_gmic/*if/ Elapsed time: 10.54 s.
Time with G'MIC 3.7.0 OLD:
[gmic]./comp_im_gmic/*if/ Elapsed time: 1.52 s.
[gmic]./ End G'MIC interpreter.

For some reason it’s a bit different with pr_iris. Probably my bad code :stuck_out_tongue: :


$ ./gmic version sp dog tic pr_iris , toc q
[gmic]./ Start G'MIC interpreter (v.3.7.0).
  gmic: GREYC's Magic for Image Computing: Command-Line Interface
        Version 3.7.0 (pre-release #26020213)
        (https://gmic.eu)

        Copyright (c) 2008-2026, David Tschumperlé / GREYC / CNRS.
        (https://www.greyc.fr)

[gmic]./ Input sample image 'dog' (1 image 1024x685x1x3).
[gmic]./ Initialize timer.
[gmic]./ Elapsed time: 3.93 s.
[gmic]./ Quit G'MIC interpreter.



$ gmic version sp dog tic pr_iris , toc q
[gmic]./ Start G'MIC interpreter (v.3.7.0).
  gmic: GREYC's Magic for Image Computing: Command-Line Interface
        Version 3.7.0 (pre-release #26021009)
        (https://gmic.eu)

        Copyright (c) 2008-2026, David Tschumperlé / GREYC / CNRS.
        (https://www.greyc.fr)

[gmic]./ Input sample image 'dog' (1 image 1024x685x1x3).
[gmic]./ Initialize timer.
[gmic]./ Elapsed time: 4.161 s.
[gmic]./ Quit G'MIC interpreter.

When doing timing comparisons with pretty close timings (like 1-3% close) ,I think it’s important to make heavy computations that takes a lot of time, so at the end you get numbers that do not depend too much on what’s happening on your system when you do the tests (e.g. file saving by other applications).

When I first tried to compared G’MIC 3.7.0 and 3.6.6, it was very frequent that 3.6.6 were giving more performant timings (like 0.2s or 0.3s faster), until I did a lot of iterations that repeatedly show 3.7.0 were slightly faster.
Here I have a 1% gain, and was able to measure that with confidence in a script that took like 4-5 minutes to complete.

Sure, here’s a hundred Julias:

OLD 3.7:
$ time ./gmic comp_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
[gmic]./ Input custom command file '/home/ck/gmic_scripts/pixlus/comp_gmic.gmic' (1 new, total: 4684).
Avg duration for G'MIC 3.7.0: 1.07517s.
[gmic]./ End G'MIC interpreter.

real    1m48.163s
user    32m11.932s
sys     0m3.017s


NEW 3.7:
$ time gmic comp_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
[gmic]./ Input custom command file '/home/ck/gmic_scripts/pixlus/comp_gmic.gmic' (1 new, total: 4683).
Avg duration for G'MIC 3.7.0: 1.07359s.
[gmic]./ End G'MIC interpreter.

real    1m47.977s
user    31m49.227s
sys     0m2.950s

But i’m starting to think that i’m not using the right new version. I’ve built it this morning from pre-release sources. Does it contain the optimizations or is it only in the binaries?

Yes, the pre-release source is updated at the same time as the binaries.
With your experiment though, the new 3.7 seems slightly faster, which looks good to me :wink:

Yes, it is faster. Forgot to turn on Performance mode though, so here it is:

OLD:
$ time comp_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
Avg duration for G'MIC 3.7.0: 0.95216s.
[gmic]./ End G'MIC interpreter.

real    1m35.836s
user    27m50.722s
sys     0m2.739s

NEW:
$ time gmic comp_gmic
[gmic]./ Start G'MIC interpreter (v.3.7.0).
Avg duration for G'MIC 3.7.0: 0.93477s.
[gmic]./ End G'MIC interpreter.

real    1m34.125s
user    27m16.424s
sys     0m2.751s

I think the speedup will probably be more visible on slower CPUs.

It will probably more visible also if you compare with G’MIC 3.6.6 ?
Not sure what is your ‘OLD’ version of G’MIC 3.7.0, but as it took me several days to add the new optimizations, there have been already some of them in your ‘OLD’ version ?

Old version is 3.7.0 (pre-release #26020213), so it’s like 8 days old :smiley:
So it’s also a bit faster than it was 8 days ago ?
Anyway, thanks to you we all will drink less coffee now :wink:

Tested it here, yeah, I see some speedup:

C:\Windows\System32>gmic 1024,1024 tic rep_pfrac 50 toc
[gmic]./ Start G'MIC interpreter (v.3.7.0).
[gmic]./ Input black image at position 0 (1 image 1024x1024x1x1).
[gmic]./ Initialize timer.
[gmic]./ Elapsed time: 0.911 s.
[gmic]./ Display image [0] = '[unnamed]'.
[0] = '[unnamed]':
  size = (1024,1024,1,1) [4096 Kio of float32].
  data = (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ... ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0).
  min = 0, max = 50104, mean = 32.642, std = 104.884, coords_min = (0,0,0,0), coords_max = (511,511,0,0).
[gmic]./ End G'MIC interpreter.

C:\Windows\System32>gmic 1024,1024 tic rep_pfrac 50 toc
[gmic]./ Start G'MIC interpreter (v.3.7.0).
[gmic]./ Input black image at position 0 (1 image 1024x1024x1x1).
[gmic]./ Initialize timer.
[gmic]./ Elapsed time: 0.894 s.
[gmic]./ Display image [0] = '[unnamed]'.
[0] = '[unnamed]':
  size = (1024,1024,1,1) [4096 Kio of float32].
  data = (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ... ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0).
  min = 0, max = 50059, mean = 32.6419, std = 104.859, coords_min = (0,0,0,0), coords_max = (511,511,0,0).
[gmic]./ End G'MIC interpreter.

But, I wouldn’t expect much given this command is iterative repeats. Curiously, I see different max result per runs, but I don’t expect much difference as it’s using unsafe multi-threading with different runs, and that’s fine by me for this one.

1 Like

Thanks for this, David.

In ImageMagick, until about v7.1.0-21, “-fx” used to be (aside from some simple tokenising) entirely interpreted. So at every pixel, the text fx expression was re-parsed. Error checking was minimal, because that is quite expensive and would need to be redundantly repeated at every pixel.

In December 2021 to February 2022, I re-wrote the fx module with the goals of (a) improving performance, (b) improving error checking while (c) retaining compatibility. I wrote about this in some detail at New FX. Current fx code is at ImageMagick/MagickCore/fx.c at main · ImageMagick/ImageMagick · GitHub

The re-written fx operates in two phases: translate to reverse Polish with proper error checking, and interpret that translation. The translation occurs once per channel. The translation uses recursion; I doubt that removing this would significantly improve performance of translation, and for large images translation is only a tiny part of the overall time.

Interpretation occurs once per pixel per channel. Interpretation is essentially a giant C-language switch statement with about 200 cases. I considered using a function pointer instead of switch, so about 200 functions. I didn’t compare the performance of these two methods, and can’t remember why I decided to use the switch method. Maybe because the in-house IM style uses big switches, and no function pointers.

Interpretation uses multiple threading, where available. It doesn’t use GPU. I suspect that GPU could significantly improve performance.

Run-time could be optimised. For example, reverse Polish works by pushing and popping values to and from a stack. An optimiser could remove pairs of push-followed-by-pop. An optimiser could also reduce constant expressions.

1 Like

Super interesting @snibgo , and well done.
The difference in speed of -fx between IM6 and IM7 is phenomenal :+1:

If it has the same effect as it does for me, then using a very small loop with function pointers saves 20-25% execution time compared to a large switch (unfortunately, I tested the opposite approach :confused: ).
For my part, I have 350+ functions, and if I understand correctly, a large switch statement with 350 cases is not necessarily very interesting compared to a loop like the one below (the one I use in my code):

        for (p_code = p_begin; p_code<p_end; ++p_code) {
          opcode._data = p_code->_data;
          const ulongT func = opcode[0], target = opcode[1];
          mem[target] = (*(mp_func)func)(*this);
        }

where

      typedef double (*mp_func)(_cimg_math_parser&);

is the basic type for each math operation.

If I understand correctly, the fact that this loop is small enough and can fit entirely within the CPU cache means that the time lost when calling functions is largely gained elsewhere.

I’m not necessarily suggesting you try this technique, as there’s no guarantee that it will actually work faster, but I’d still be curious to see how it would turn out :wink:

Also, I would like to say you have my utmost respect. Writing such a mathematical expression evaluator is such a huge job!