I’ve been running a small script that invokes LLMs to come up with ideas to speed up code, and then to implement the ideas, check them against the integration tests used by the darktable build, and benchmark them if passed; then, commit the working updates. My first target is DoS (CPU path), a module that’s notoriously slow in this mode, measuring performance with a 24 MPx image, to which I applied all 21 presets.
The current state is (all times in seconds; speed: original time / current time, so cutting the time in half means 2x speed; efficiency: CPU time before / after, time saved: (original time - new time) / original time):
preset
user time before
CPU time before
user time after
CPU time after
speed
efficiency
time saved
artistic effects / bloom
1.545
13.327
0.89
5.776
1.735955056
2.307306094
42.39%
artistic effects / simulate line drawing
73.138
846.114
41.218
473.38
1.774418943
1.787388567
43.64%
artistic effects / simulate watercolor
5.911
63.404
3.777
38.903
1.564998676
1.629797188
36.10%
dehaze / extra contrast
18.95
216.294
10.071
112.443
1.881640353
1.923587951
46.85%
dehaze / default
19.092
216.523
10.08
112.349
1.894047619
1.927235667
47.20%
denoise / coarse
34.244
395.792
13.389
154.237
2.557621928
2.56612875
60.90%
denoise / fine
22.859
262.857
8.07
92.757
2.832589839
2.833823862
64.70%
denoise / medium
28.491
329.083
10.715
123.259
2.658982734
2.669849666
62.39%
inpaint highlights
12.343
137.508
9.518
106.167
1.296806052
1.295204725
22.89%
lens deblur / hard
25.696
295.337
13.59
156.323
1.89080206
1.889274131
47.11%
lens deblur / medium
17.127
196.05
7.692
86.271
2.226599064
2.272490176
55.09%
lens deblur / soft
7.207
80.798
4.041
43.402
1.783469438
1.86161928
43.93%
local contrast / fast
2.084
18.901
1.464
10.578
1.423497268
1.786821705
29.75%
local contrast / fine
9.663
106.837
5.459
57.555
1.770104415
1.85625923
43.51%
local contrast / normal
19.052
216.31
11.904
133.268
1.60047043
1.623120329
37.52%
sharpen demosaicing / AA filter
1.113
9.07
0.809
5.82
1.375772559
1.558419244
27.31%
sharpen demosaicing / no AA filter
0.873
7.335
0.643
4.562
1.357698289
1.607847435
26.35%
sharpness / fast
1.866
16.457
1.306
9.322
1.428790199
1.765393692
30.01%
sharpness / normal
2.39
24.139
1.7
15.4
1.405882353
1.567467532
28.87%
sharpness / strong
4.503
49.134
2.958
30.867
1.522312373
1.591797065
34.31%
surface blur
2.752
27.085
1.805
16.02
1.52465374
1.690699126
34.41%
total
310.899
3528.355
161.099
1788.659
1.929863003
1.972625861
48.18%
There’s no PR yet; I’m wondering how long I’ll keep getting improvements (I have rather limited LLM plans, so run into limits all the time, but I can just keep the script running and it recovers when a new usage cycle starts).
I have been looking a bit through your commits. Nice work. From what I see, it mostly refactoring the code to store intermediate result of the ‘relatively more expensive’ function like cos, sin, sqrt.
This makes sense. Makes much sense.
I am surprised though that all you in all gain so much, 20~50% faster is huge.
Very nice work. Is there anything we can help you with?
No need for help at this point, I just wanted to show this.
I’ve realised that since I’m basing acceptance / rejection of a change on the overall speed-up it brings, often-used presets with lower runtime (e.g. sharpen demosaicing) are probably underweighted, compared to hardly ever used presets with longer runtimes (e.g. simulate line drawing). I may alter the script, or set all presets to use the same number of iterations for the performance tuning.
@martinus : I’m no good at such low-level optimisations, I did not even look at the changes. I did add some more integration tests (there was one bug that Claude made, and later it realised the problem – the mistake was not caught by the existing integration test, so I had it generate more to improve coverage, and checked that they would have caught it).
It it works out, we can run it against the OpenCL version, too, and for other modules, as well.
I’ve updated the stats with the latest figures; now our ‘worst’ speed-up is about 35% (sharpen demosaicing / no AA filter), the average is about 80% (so what took 1.8 s now takes about 1.0 s on average). denoise / fine is now almost 3x as fast as it used to be. I’ll keep the bots running.
This is really awesome! Just a question from an engineering point of view, how do you ensure that the output of the algorithm stays equal to output before this refactoring?
Via the provided integration tests (and I added 3 new tests, as the one that already existed for diffuse or sharpen did not catch an error introduced early on).
An update: I’ve now switched to a different strategy, instead of optimising the total time, specifically targeting the least-improved module in each round.
The results have improved further, but at a cost: the module has more than doubled in terms of lines of code. Needless to say, this is unacceptable and unmaintainable.
Nevertheless, I’ll continue this for a while. In the end, I’ll do more analysis on what worked and what did not, and will try to find a good compromise, keeping most of the performance improvement while preventing the code from blowing up. I’m pretty sure a bug has also crept in, as inpaint highlights went from over 12 seconds to about 40 ms (a 300x speed-up), which does not seem realistic. I’ll need to extend the darktable integration test suite further, because neither the original test, nor the 3 others I added were able to catch that.
Wow. I’ve added the inpaint test, generated a reference image with the master code and the optimised code – and the test passed. I’m sure something’s wrong, maybe I’ll need to adjust params. Have to read the manual.
Test 0091-diffuse-inpaint-highlights
Image DSC_9034.NEF
Expected CPU vs. current CPU report :
----------------------------------
Max dE : 0.32930
Avg dE : 0.00000
Std dE : 0.00020
----------------------------------
Pixels below avg + 0 std : 100.00 %
Pixels below avg + 1 std : 100.00 %
Pixels below avg + 3 std : 100.00 %
Pixels below avg + 6 std : 100.00 %
Pixels below avg + 9 std : 100.00 %
----------------------------------
Pixels above tolerance : 0.00 %
Update: it’s actually ‘real’, in a way: the code has changed so inpaint highlights does not process darker areas (which don’t need inpainting), and the test image simply does not trigger it. So, if you have no reason to turn on the module, it’s blazing fast now.
I am very curious to hear whether this also holds for the images that trigger the inpaint highlights. At least you have accomplished two this, even if it turns out to be a bug:
You have demonstrated that for images that will not trigger inpaint, the implementation can be much optimised.
You have demonstrated that a better test case is need :-). This is actually very very valuable.
Although I think - reading from what you wrote - that there is indeed a bug.
But I have seen more than once speed-ups in the order of 100x. Sometimes an engineer comes up with a clever way to improve performance big time. I remember that with one software project I was involved in we had to introduce the performance improvement gradely because our customers were not trusting the result after the huge performance update…
[We actually made the improvement directly but did do a Thread.Sleep after that which we reduced over time and after a year was removed]
I’ve now checked with an image that is affected by highlight inpainting, and it passed the test:
Test 0091-diffuse-inpaint-highlights
Image aces_1920x1080_graded.00349.exr
Expected CPU vs. current CPU report :
----------------------------------
Max dE : 1.01857
Avg dE : 0.00000
Std dE : 0.00139
----------------------------------
Pixels below avg + 0 std : 100.00 %
Pixels below avg + 1 std : 100.00 %
Pixels below avg + 3 std : 100.00 %
Pixels below avg + 6 std : 100.00 %
Pixels below avg + 9 std : 100.00 %
----------------------------------
Pixels above tolerance : 0.00 %
Above: original image; below: with inpainting (blurring the highlights), with darktable master (the optimised version delivers the same result, as shown above).
With that particular image, there’s a speed-up of about 80% (takes a bit more than half of what it used to, although about half of that comes from other tweaks, not specific to the inpaint highlights optimisation). That means on a typical image, with only a small fraction of highlights, a speed-up of 10x is perfectly realistic:
You probably know already, but it’s worth mentioning that LLMs will not refactor code unless they are explicitly asked to. This leads to code bloat in the shape of repetition, where the models generate the same (or very similar) bits of code instead of abstracting away a common utility function and reusing that. So, to some extent the code footprint explosion may be contained by a few iterations where you ask the model to optimize for structure, best-practices, code reuse and readability.
Also, in the end “more code” is not necessarily bad, it depends on the quality of that code. More LOCs of explicit code may be more maintainable than a few cryptic lines. The infamous Perl one-liners are a perfect example of that