diffuse or sharpen optimisation

I’ve been running a small script that invokes LLMs to come up with ideas to speed up code, and then to implement the ideas, check them against the integration tests used by the darktable build, and benchmark them if passed; then, commit the working updates. My first target is DoS (CPU path), a module that’s notoriously slow in this mode, measuring performance with a 24 MPx image, to which I applied all 21 presets.

The current state is (all times in seconds; speed: original time / current time, so cutting the time in half means 2x speed; efficiency: CPU time before / after, time saved: (original time - new time) / original time):

preset user time before CPU time before user time after CPU time after speed efficiency time saved
artistic effects / bloom 1.545 13.327 0.89 5.776 1.735955056 2.307306094 42.39%
artistic effects / simulate line drawing 73.138 846.114 41.218 473.38 1.774418943 1.787388567 43.64%
artistic effects / simulate watercolor 5.911 63.404 3.777 38.903 1.564998676 1.629797188 36.10%
dehaze / extra contrast 18.95 216.294 10.071 112.443 1.881640353 1.923587951 46.85%
dehaze / default 19.092 216.523 10.08 112.349 1.894047619 1.927235667 47.20%
denoise / coarse 34.244 395.792 13.389 154.237 2.557621928 2.56612875 60.90%
denoise / fine 22.859 262.857 8.07 92.757 2.832589839 2.833823862 64.70%
denoise / medium 28.491 329.083 10.715 123.259 2.658982734 2.669849666 62.39%
inpaint highlights 12.343 137.508 9.518 106.167 1.296806052 1.295204725 22.89%
lens deblur / hard 25.696 295.337 13.59 156.323 1.89080206 1.889274131 47.11%
lens deblur / medium 17.127 196.05 7.692 86.271 2.226599064 2.272490176 55.09%
lens deblur / soft 7.207 80.798 4.041 43.402 1.783469438 1.86161928 43.93%
local contrast / fast 2.084 18.901 1.464 10.578 1.423497268 1.786821705 29.75%
local contrast / fine 9.663 106.837 5.459 57.555 1.770104415 1.85625923 43.51%
local contrast / normal 19.052 216.31 11.904 133.268 1.60047043 1.623120329 37.52%
sharpen demosaicing / AA filter 1.113 9.07 0.809 5.82 1.375772559 1.558419244 27.31%
sharpen demosaicing / no AA filter 0.873 7.335 0.643 4.562 1.357698289 1.607847435 26.35%
sharpness / fast 1.866 16.457 1.306 9.322 1.428790199 1.765393692 30.01%
sharpness / normal 2.39 24.139 1.7 15.4 1.405882353 1.567467532 28.87%
sharpness / strong 4.503 49.134 2.958 30.867 1.522312373 1.591797065 34.31%
surface blur 2.752 27.085 1.805 16.02 1.52465374 1.690699126 34.41%
total 310.899 3528.355 161.099 1788.659 1.929863003 1.972625861 48.18%

There’s no PR yet; I’m wondering how long I’ll keep getting improvements (I have rather limited LLM plans, so run into limits all the time, but I can just keep the script running and it recovers when a new usage cycle starts).

For those interested, the branch is here: Commits · kofa73/darktable · GitHub

Each commit contains a short description.
Once it’s discussed with the team, and the method is accepted, I’ll release the script, as well.

19 Likes

I have been looking a bit through your commits. Nice work. From what I see, it mostly refactoring the code to store intermediate result of the ‘relatively more expensive’ function like cos, sin, sqrt.

This makes sense. Makes much sense.

I am surprised though that all you in all gain so much, 20~50% faster is huge.

Very nice work. Is there anything we can help you with?

2 Likes

Excellent stuff, István!

2 Likes

AP seems to be working on this module in his fork. He mentioned bug fixes, so it might be good to cherypick his commits.

8 Likes

Awesome is much overused these days but this IS awesome @kofa !
Especially the good figures for deblur.

1 Like
1 Like

No need for help at this point, I just wanted to show this.

I’ve realised that since I’m basing acceptance / rejection of a change on the overall speed-up it brings, often-used presets with lower runtime (e.g. sharpen demosaicing) are probably underweighted, compared to hardly ever used presets with longer runtimes (e.g. simulate line drawing). I may alter the script, or set all presets to use the same number of iterations for the performance tuning.

@martinus : I’m no good at such low-level optimisations, I did not even look at the changes. I did add some more integration tests (there was one bug that Claude made, and later it realised the problem – the mistake was not caught by the existing integration test, so I had it generate more to improve coverage, and checked that they would have caught it).

It it works out, we can run it against the OpenCL version, too, and for other modules, as well.

2 Likes

Great timing… :-/

2 Likes

I’ve updated the stats with the latest figures; now our ‘worst’ speed-up is about 35% (sharpen demosaicing / no AA filter), the average is about 80% (so what took 1.8 s now takes about 1.0 s on average). denoise / fine is now almost 3x as fast as it used to be. I’ll keep the bots running. :slight_smile:

9 Likes

This is really awesome! Just a question from an engineering point of view, how do you ensure that the output of the algorithm stays equal to output before this refactoring?

2 Likes

Darktable has tooling to diff images to check for differences in output, check this recent issue out: Recent regressions to be analyzed (more feedback needed) · Issue #20651 · darktable-org/darktable · GitHub

2 Likes

Via the provided integration tests (and I added 3 new tests, as the one that already existed for diffuse or sharpen did not catch an error introduced early on).

3 Likes

Nice, really nice.

An update: I’ve now switched to a different strategy, instead of optimising the total time, specifically targeting the least-improved module in each round.
The results have improved further, but at a cost: the module has more than doubled in terms of lines of code. Needless to say, this is unacceptable and unmaintainable.
Nevertheless, I’ll continue this for a while. In the end, I’ll do more analysis on what worked and what did not, and will try to find a good compromise, keeping most of the performance improvement while preventing the code from blowing up. I’m pretty sure a bug has also crept in, as inpaint highlights went from over 12 seconds to about 40 ms (a 300x speed-up), which does not seem realistic. :slight_smile: I’ll need to extend the darktable integration test suite further, because neither the original test, nor the 3 others I added were able to catch that.

6 Likes

Thank you for the awesome efforts AND critical thinking Kofa.

Wow. I’ve added the inpaint test, generated a reference image with the master code and the optimised code – and the test passed. I’m sure something’s wrong, maybe I’ll need to adjust params. Have to read the manual.

Test 0091-diffuse-inpaint-highlights
      Image DSC_9034.NEF
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 0.32930
      Avg dE                   : 0.00000
      Std dE                   : 0.00020
      ----------------------------------
      Pixels below avg + 0 std : 100.00 %
      Pixels below avg + 1 std : 100.00 %
      Pixels below avg + 3 std : 100.00 %
      Pixels below avg + 6 std : 100.00 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %

Update: it’s actually ‘real’, in a way: the code has changed so inpaint highlights does not process darker areas (which don’t need inpainting), and the test image simply does not trigger it. :slight_smile: So, if you have no reason to turn on the module, it’s blazing fast now. :smiley:

5 Likes

Hahaha :laughing:

So the test image is too dark to trigger the threshold for inpainting?

I am very curious to hear whether this also holds for the images that trigger the inpaint highlights. At least you have accomplished two this, even if it turns out to be a bug:

  1. You have demonstrated that for images that will not trigger inpaint, the implementation can be much optimised.
  2. You have demonstrated that a better test case is need :-). This is actually very very valuable.

Although I think - reading from what you wrote - that there is indeed a bug.

But I have seen more than once speed-ups in the order of 100x. Sometimes an engineer comes up with a clever way to improve performance big time. I remember that with one software project I was involved in we had to introduce the performance improvement gradely because our customers were not trusting the result after the huge performance update…

[We actually made the improvement directly but did do a Thread.Sleep after that which we reduced over time and after a year was removed]

I’ve now checked with an image that is affected by highlight inpainting, and it passed the test:

Test 0091-diffuse-inpaint-highlights
      Image aces_1920x1080_graded.00349.exr
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 1.01857
      Avg dE                   : 0.00000
      Std dE                   : 0.00139
      ----------------------------------
      Pixels below avg + 0 std : 100.00 %
      Pixels below avg + 1 std : 100.00 %
      Pixels below avg + 3 std : 100.00 %
      Pixels below avg + 6 std : 100.00 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %

Above: original image; below: with inpainting (blurring the highlights), with darktable master (the optimised version delivers the same result, as shown above).

With that particular image, there’s a speed-up of about 80% (takes a bit more than half of what it used to, although about half of that comes from other tweaks, not specific to the inpaint highlights optimisation). That means on a typical image, with only a small fraction of highlights, a speed-up of 10x is perfectly realistic:

3 Likes

You probably know already, but it’s worth mentioning that LLMs will not refactor code unless they are explicitly asked to. This leads to code bloat in the shape of repetition, where the models generate the same (or very similar) bits of code instead of abstracting away a common utility function and reusing that. So, to some extent the code footprint explosion may be contained by a few iterations where you ask the model to optimize for structure, best-practices, code reuse and readability.

Also, in the end “more code” is not necessarily bad, it depends on the quality of that code. More LOCs of explicit code may be more maintainable than a few cryptic lines. The infamous Perl one-liners are a perfect example of that :slight_smile:

4 Likes