darktable performance regression with kernel 6.7 and newer.

Hi, I am experiencing a performance regression with darktable with newer kernels. The LTS kernels 6.6.x are fine, but starting with the 6.7 series up to the most recent kernels I see a performance regression of almost 25 %.

I have addressed this to the darktable developers but they do not see this as a darktable issue but rather a kernel issue and are not willing to invest time in any investigation. The issue Ihave created for this topic has been closed: Performance regression with atrous module and newer kernels (e.g. 6.7.12 or 6.10.6) compared to 6.6.47 ¡ Issue #17397 ¡ darktable-org/darktable ¡ GitHub

In that issue it was suggested that I start a thread here in this forum. So here I am. I want to address this topic also to the kernel developers. But before I do so I am seeking some confirmation that I am not alone with this performance regression.

Here are my findings in a nutshell.

I do a raw conversion on the commandline with opencl disabled. Everything is on the CPU. I check the debug output for the time darktable spent in the pixel pipeline:

darktable-cli bench.SRW /tmp/test.jpg --core --disable-opencl -d perf -d opencl --configdir /tmp

The revelant output line looks like this:

4,2765 [dev_process_export] pixel pipeline processing took 3,811 secs (81,883 CPU)

When I do this benchmark with different kernels I find that with kernel 6.7 and newer the pixelpipline is roughly 25 % slower than with kernel 6.6 and older (I also tested with kernel 6.5)

The main contributor seems to be the module called ‘atrous’. It wastes most of the time with newer kernels:

with kernel 6.6.47:

4,0548 [dev_pixelpipe] took 0,635 secs (14,597 CPU) [export] processed 'atrous' on CPU, blended on CPU
...
4,2765 [dev_process_export] pixel pipeline processing took 3,811 secs (81,883 CPU)

with kernel 6.10.6:

4,9645 [dev_pixelpipe] took 1,489 secs (33,736 CPU) [export] processed 'atrous' on CPU, blended on CPU
...
5,2151 [dev_process_export] pixel pipeline processing took 4,773 secs (102,452 CPU)

This example shows that the atrous module accounts for all the performance drop. Overall conversion time goes from 3.8 s to 4.7 s on my PC. That is significant.

Does anybody else experience a similar performance drop with kernels 6.7+?

I did not try using different kernels, but recently I’ve felt a drop in performance using darktable. I tried the same pipe on Windows and it was significantly quicker, but I’m using GPU and maybe implementations/compatibility are not the same. Also, it could just be caused by the new “color equalizer” module which Istarted to use recently and which is rather slow.

What version of DT are you using and are you building it yourself… Also if you used openCL was there a difference… if not and since this would be the normal way to optimally run DT does it really matter that much…

A wild guess, but wasn’t there a change in the CPU scheduler?
However, it appears the change was introduced in 6.6 (Linux 6.6 bringt neuen Scheduler | heise online only in German though), but each version probably introduced new changes.

this code is heavily simdfied and multithreaded. are you sure both runs use the same code path? i.e. the sse vs the i386 implementations. another thing i would check is hyperthreading. do you use it? maybe there was another one of the kernel security fixes that make everything slower? can you disable hyperthreading in the bios for a test?

I am doing all this on a Endeavouros (arch) linux system. darktable is installed with the packagemanager. version is 4.8.1

With opencl enabled there is no difference. darktable performance the same with all kernels. Only when I force darktable to only use the CPU I see the performance drop.

If it would be a scheduler issue, it would affect all modules in the pixel pipeline the same way. But that is not the case.

I am really looking for help resp. support to reproduce my findings. Can someone here please try it? It does not take much. You only need to be able to boot into LTS kernel 6.6 and a newer one, most likely 6.10.x. And you need a RAW file of your choice to do the darktable_cli run. You capture darktables debug output and post it here.

Raw and sidecar from the GitHub issue:

https://drive.google.com/drive/folders/1cfV2b893JuobVwGiZXcaNv5-yszH6j-N?usp=sharing

I’m running windows…sorry can’t help. I wonder if you did use a compiled version for your hardware ie your CPU if the difference would still be there… its not too hard to set up and compile from the master or any release branch that you might prefer…

That’s true. However, why should the kernel influence just a single module any ways? I do not really see a reason either?

Last time I checked, my PC would not work with the current bookworm kernel (6.1), that’s why I’m using trixie (6.10)…

‘Unfortunately’, only 6.8 is available on my Ubuntu box.
For me, atrous (contrast equalizer) runs in 1 second on the provided file. Results from 5 runs:

    12.3992 [dev_pixelpipe] took 1.062 secs (11.710 CPU) [export] processed `atrous' on CPU, blended on CPU
    12.2982 [dev_pixelpipe] took 1.049 secs (11.645 CPU) [export] processed `atrous' on CPU, blended on CPU
    12.3278 [dev_pixelpipe] took 1.051 secs (11.622 CPU) [export] processed `atrous' on CPU, blended on CPU
    12.3308 [dev_pixelpipe] took 1.054 secs (11.599 CPU) [export] processed `atrous' on CPU, blended on CPU
    12.3620 [dev_pixelpipe] took 1.050 secs (11.676 CPU) [export] processed `atrous' on CPU, blended on CPU
    12.8033 [dev_process_export] pixel pipeline processing took 7.064 secs (68.451 CPU)
    12.7200 [dev_process_export] pixel pipeline processing took 7.029 secs (68.208 CPU)
    12.7278 [dev_process_export] pixel pipeline processing took 7.063 secs (68.205 CPU)
    12.7425 [dev_process_export] pixel pipeline processing took 7.030 secs (68.547 CPU)
    12.7602 [dev_process_export] pixel pipeline processing took 7.066 secs (68.695 CPU)

Interestingly, the pipeline took 7 seconds for me, but ~3.8 or ~4.8 seconds for you (so my CPU would perform a lot worse than yours); at the same time, atrous took 1 second for me, between your times of ~0.6 s and ~1.5 s.

for in in {1..5} ; do rm -rf /tmp/tmp out*jpg ; ~/darktable-master/bin/darktable-cli bench.SRW bench.SRW.xmp out.jpg --core --configdir /tmp/tmp --cachedir /tmp/tmp -d perf --disable-opencl ; done

Thank you for taking the time and doing that run, but it is not helpful. There is no meaning in comparing values between your PC and my PC.

It only makes sense if we have a second run on your PC with an LTS kernel. Then we can compare your numbers 6.6 vs. 6.8

Well, it does influence only this one module. That a fact.

I do not know what the ‘atrous’ module is doing. That would be a good question for the developers. But they refused to look into this (see my github issue).

It could, for example, be related to memory management. May be atrous is doing heavy memory operations (allocation, copy, move, etc.). And may be the kernels starting with 6.7 chnaged some memory management functionality.

As soon as I have feedback here in the forum where someone else confirms my findings I will create a ticket for the kernel devs.

You are right, of course. I could try booting from a pendrive that has the installer image, and try the AppImage of 4.8.1 both with an older kernel and a current one. I’ll post the results if I can find the time to do the experiment.

I believe this is the contrast eq…so wavelet computations??

Okay, got 6.6.13 running versus 6.10.11 and have the following results:

kernel_6.10.log:     5,1859 [dev_pixelpipe] took 1,018 secs (22,754 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.10.log:     5,2105 [dev_pixelpipe] took 0,990 secs (22,723 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.10.log:     5,2137 [dev_pixelpipe] took 0,979 secs (22,683 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.10.log:     5,2013 [dev_pixelpipe] took 0,984 secs (22,741 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.10.log:     5,1942 [dev_pixelpipe] took 0,983 secs (22,609 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.6.log:     5,2350 [dev_pixelpipe] took 0,873 secs (20,387 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.6.log:     5,1209 [dev_pixelpipe] took 0,882 secs (20,361 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.6.log:     5,5227 [dev_pixelpipe] took 0,874 secs (20,344 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.6.log:     5,1139 [dev_pixelpipe] took 0,866 secs (20,267 CPU) [export] processed `atrous' on CPU, blended on CPU
kernel_6.6.log:     5,1436 [dev_pixelpipe] took 0,868 secs (20,179 CPU) [export] processed `atrous' on CPU, blended on CPU

Total time:

kernel_6.10.log:     5,4362 [dev_process_export] pixel pipeline processing took 4,723 secs (103,571 CPU)
kernel_6.10.log:     5,4949 [dev_process_export] pixel pipeline processing took 4,783 secs (104,258 CPU)
kernel_6.10.log:     5,4740 [dev_process_export] pixel pipeline processing took 4,755 secs (103,824 CPU)
kernel_6.10.log:     5,4525 [dev_process_export] pixel pipeline processing took 4,744 secs (104,246 CPU)
kernel_6.10.log:     5,4451 [dev_process_export] pixel pipeline processing took 4,733 secs (103,544 CPU)
kernel_6.6.log:     5,4911 [dev_process_export] pixel pipeline processing took 4,673 secs (101,989 CPU)
kernel_6.6.log:     5,3685 [dev_process_export] pixel pipeline processing took 4,646 secs (100,467 CPU)
kernel_6.6.log:     5,7639 [dev_process_export] pixel pipeline processing took 4,698 secs (101,239 CPU)
kernel_6.6.log:     5,3609 [dev_process_export] pixel pipeline processing took 4,625 secs (100,714 CPU)
kernel_6.6.log:     5,3827 [dev_process_export] pixel pipeline processing took 4,664 secs (100,630 CPU)

CPU: AMD Ryzen 9 3900

kernel_6.6.log.txt (17.6 KB)
kernel_6.10.log.txt (17.6 KB)

There appears to be a difference of around 13% on average in my case.

Thank you for doing that test. Your difference is not as much as mine but still significant.
I wonder if it makes a difference that you used the darktable git version.

What kind of PC do you have? Intel or AMD?

It’s an AMD

A friend of mine reported and fixed a kernel issue that was giving him a 25-35% slowdown on AMD CPUs. His workload was not darktable, but the kernel fix landed in 6.6.51. No clue if it is similar to this or not.

EDIT: it isn’t similar, his regression was a missing backported patch. Funny coincidence tho.

This is what I was alluding to earlier…I wonder if you had a version compiled on your PC using any available CPU enhancements specific to your hardware if you would see something different over the generic one that you likely have now???