Help me understand GPU use in dartktable

Dave_G · February 4, 2024, 10:13pm

I’ve recently become plagued with frustratingly slow performance when I load images into darktable.

When I open a batch of photos, whether it’s just a few or over 1000, the thumbnails take forever to load in lighttable, and when I try to open a group in culling mode and zoom to 100%, the system slows to a crawl.

While this is going on, I’m monitoring CPU and GPU usage, and the CPU is pegged at 100%, while the GPU is sitting unused at 0%.

Is this normal? Should I expect the GPU to be used for these tasks? Is there something wrong in my darktable settings?

My system:
• Fedora 39 Linux
• darktable 4.6.0
• CPU: AMD Ryzen 7 3700X 8-Core
• GPU: nVidia GeForce GTX1660 Super
• 32GB RAM

My GPU settings in darktablerc:

opencl=TRUE
opencl_async_pixelpipe=false
opencl_avoid_atomics=false
opencl_checksum=1725294375
opencl_device_priority=+0/+0/+0/+0/+0
opencl_disable_drivers_blacklist=false
opencl_library=
opencl_mandatory_timeout=20000
opencl_memory_headroom=400
opencl_memory_requirement=768
opencl_micro_nap=1000
opencl_number_event_handles=25
opencl_scheduling_profile=default
opencl_size_roundup=16
opencl_synch_cache=active module
opencl_tune_headroom=FALSE
opencl_tuning_mode=memory size
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false

darktable-cltest shows the GPU is recognized and enabled:

[opencl_init] OpenCL successfully initialized. internal numbers and names of available devices:
[opencl_init]		0	'NVIDIA CUDA NVIDIA GeForce GTX 1660 SUPER'
     1.0135 [opencl_init] FINALLY: opencl is AVAILABLE and ENABLED.
[opencl_init] opencl_scheduling_profile: 'default'
[opencl_init] opencl_device_priority: '+0/+0/+0/+0/+0'
[opencl_init] opencl_mandatory_timeout: 20000
[dt_opencl_update_priorities] these are your device priorities:
[dt_opencl_update_priorities] 		image	preview	export	thumbs	preview2
[dt_opencl_update_priorities]		0	0	0	0	0
[dt_opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[dt_opencl_update_priorities] 		image	preview	export	thumbs	preview2
[dt_opencl_update_priorities]		1	1	1	1	1
[opencl_synchronization_timeout] synchronization timeout set to 200
[dt_opencl_update_priorities] these are your device priorities:
[dt_opencl_update_priorities] 		image	preview	export	thumbs	preview2
[dt_opencl_update_priorities]		0	0	0	0	0
[dt_opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[dt_opencl_update_priorities] 		image	preview	export	thumbs	preview2
[dt_opencl_update_priorities]		1	1	1	1	1

And yet, I don’t see anything hit the GPU at more than a brief bump of a few percent. I would expect it to hit 100% GPU usage. Or am I understanding this wrong?

g-man · February 4, 2024, 10:22pm

A lot of those settings in darktablerc don’t look correct. Create a backup of that file and then rename it to old so you can start over with opencl settings.

Post the complete output of darktbale-cltest and an output of -d common (both as txt files).

Dave_G · February 4, 2024, 10:45pm

Thanks. I’ve moved my darktablerc file and allowed darktable to create a new default file. Here are the txt files with output from darktable.cltest and darktable -d common.

darktable.dcommon.txt (24.0 KB)
darktable-cltest.txt (4.1 KB)

g-man · February 4, 2024, 11:42pm

GPU looks good. Since it shows that you have 4 platforms, keep only the Nvidia checkbox on.

With your fast card, I would change the processing from default to ‘very fast gpu’. This will force all the processing to happen in the GPU instead of the CPU. That’s how I use it in my system.

The filmicrgb is taking a some execution time. Do you have a lot of iterations in it? The GPU path should help with this vs CPU.

Dave_G · February 5, 2024, 12:14am

Thanks! I’ve switched it back to “very fast gpu”. I actually don’t use filmic rgb any more. I recently switched to using sigmoid, but that setting was presumably lost when I reset to a default darktablerc file. I’ve got the old one, so I can get that back.

This may as good as I can get it, but darktable is still annoyingly slow when I zoom to 100% in culling mode. I use this workflow a lot in wildlife photography, where I may take 10-50 photos in a burst and use culling mode in lighttable to select the best one to process. When I zoom to 100%, all the images show “working…”, and with each group of 4 (which is the maximum number that lets me zoom), I need to wait 30 seconds or more before I can see the zoomed image. When going through hundreds of files, it becomes a problem. When I monitor my system with nvtop it seems this process is still using the CPU instead of the GPU. See the attached image: CPU is at 584% (which is possible because I have 16 cores), while the GPU is still at 0%.

Is that just the way darktable is, i.e. zooming in lighttable always uses the CPU regardless?

g-man · February 5, 2024, 12:23am

The -d common shows that the thumbnails are processing filmicrgb.

    18.3825 pixelpipe process CPU      [thumbnail]      filmicrgb              (   0/   0) 8191x5463 scale=1.0000 --> (   0/   0) 8191x5463 scale=1.0000 IOP_CS_RGB
    26.2022 transform colorspace CPU   [thumbnail]      colorout               (   0/   0) 8191x5463 scale=1.0000 --> (   0/   0) 8191x5463 scale=1.0000 IOP_CS_RGB -> IOP_CS_LAB

It took 8secs there.

Can you post a -d perf now that you have it setup to Very Fast GPU?

Dave_G · February 5, 2024, 12:40am

Here’s the output from -d perf. It looks like a lot of thumbnail processing is happening on the CPU. I never realized that so much processing happened on thumbnails. Is there a way to turn that off? Or at least tell it to use the GPU?
darktable.dperf.txt (8.5 KB)

g-man · February 5, 2024, 12:57am

We have a similar system (Fedora 39 KDE with Nvidia card). My thumbnail processing is all in GPU and it takes 0.2s total.

There is something going on in your system. It started processing in GPU and then it switched to CPU. It stayed in CPU and some of the modules are taking a long time. The last module to use GPU was:

28.2809 [dev_pixelpipe] took 2.768 secs (7.482 CPU) [thumbnail] processed 'filmicrgb' on GPU with tiling, blended on CPU
With 6Gb of GPU, I’m not sure why it is even tiling.

Sorry to keep asking for more files, lets do a -d opencl

Dave_G · February 5, 2024, 1:24am

Thanks for your help with this! I’ve run it with -d opencl and attached the output. Something puzzles me about this. It shows reached opencl_mandatory_timeout trying to lock mandatory device. I had earlier tried forcing the GPU by setting opencl_device_priority=+0/+0/+0/+0/+0. That didn’t do anything, so I set it back to the default of opencl_device_priority=*/!0,*/*/*/!0,*. But it still seems to consider the GPU as mandatory and waiting for a timeout before switching to the CPU.

Also, where I’m seeing the greatest delay is in culling mode when zooming four images to 100%, which I guess are actually previews, not thumbnails. But regardless, thumbnails are still showing CPU use.
darktable.dopencl2.txt (6.1 KB)

g-man · February 5, 2024, 1:35am

Let’s bump the mandatory timeout setting in darktablerc. Let’s try 1000 (this equals 5secs). This is not the root cause of the issue.

Also, let’s go back to d common. The d opencl is not showing too much info.

priort · February 5, 2024, 1:43am

You could also try deleting your OPENCL kernels and let them get recreated…in the past this has helped a couple of users.

Also there are thumbnail settings in preferences…have you changed any of those??

Dave_G · February 5, 2024, 3:05am

I’ve bumped the mandatory timeout to 1000 and run with -d common. Nothing different that I can see but I haven’t looked at the output very closely yet (and don’t really know what to look for anyway).

The only change to the thumbnail settings I’ve made is allowing generate thumbnails in background.

opencl is installed from RPM packages, which are up to date, so I’m not sure how to recreate the kernel other than reinstalling the packages.
darktable.dcommon2.txt (188.4 KB)

g-man · February 5, 2024, 3:33am

When dt starts, it creates opencl kernels to use in the processing modules. It uses the current drivers to create them and stores them in .cache/darktable. You can safely delete the folders since it it will check/generate them at startup. Every time fedora updates the fusion rpm, a new kernel will be generated.

It looks from the latest -d common that most of the processing was in the GPU. Is it faster now?

The main thing I currently notice from your data is the lens module. There are multiple times with the ROI input being smaller than the output.

     21.6517 modify roi IN              [thumbnail]      lens                   (   0/   0) 1349x 900 scale=0.2467 --> (   0/   0) 1350x 900 scale=0.2467

Dave_G · February 5, 2024, 4:07am

No, it’s still slow generating thumbnails and previews, and nvtop shows me it’s still pegging the CPU and not touching the GPU.

In lighttable, switching to culling view with four images and setting preview zoom to 100% takes anywhere from 45 to 70 seconds, all using the CPU.

priort · February 5, 2024, 4:24am

What about bumping the handles…its 128 I think … try 1024… also you could try async pipeline… I think you have that disabled… barring those 2 changes I set my micronap to 0 and I get away with that and it did speed things up for me… Just trying to see what’s up as you seem to shift out of GPU from what I read by scanning your various log files…

kofa · February 5, 2024, 6:07am

Iterations? Aren’t you confusing filmic rgb with diffuse or sharpen?

kofa · February 5, 2024, 6:20am

Todd, you are talking about tuning, some of them, e.g. async pipeline, being dangerous options. Here, we should get OpenCL enabled (utilised) first.

We do see on-GPU processing, e.g.

23.3176 pixelpipe process CL       [thumbnail]      exposure               (   0/   0) 1350x 900 scale=0.2467 --> (   0/   0) 1350x 900 scale=0.2467 IOP_CS_RGB
23.3191 pixelpipe process CL       [thumbnail]      colorin                (   0/   0) 1350x 900 scale=0.2467 --> (   0/   0) 1350x 900 scale=0.2467 IOP_CS_RGB -> IOP_CS_LAB
23.3192 matrix conversion on GPU   [thumbnail]      colorin                (   0/   0) 1350x 900 scale=0.2467 --> (   0/   0) 1350x 900 scale=0.2467 `standard color matrix'

Maybe run with -d opencl instead of -d common to reduce the noise.

What version of the Nvidia driver are you using?

Try deleting cached kernels in ~/.cache/darktable.

g-man · February 5, 2024, 11:29am

I don’t think these are good suggestions.

g-man · February 5, 2024, 11:30am

Filmicrgb has iterations. It is in the last tab. I think it has to do with using the hrl withing filmic.

kofa · February 5, 2024, 11:34am

Ah yes, but highlight reconstruction is disabled by default.