Help needed to troubleshoot Open CL tuning

So I have a new laptop that I’m preparing for darktable now. OK, it’s not a brand new one, since my budget was somehow limited :wink: It’s an ASUS Zenbook Pro 14 with has some not too bad components stuffed inside that makes me looking forward to get some usable RAW workflow ready.

  • Intel i5-8265U Whiskey Lake, 8GB RAM
  • GeForce GTX 1050 (Max-Q) with 4GB of GDDR5
  • Intel UHD 620, well OK
  • 512G SSD WDC PC SN520 SDAPNUW-512G-1002

I’m currently using R-Darktable that I built on this laptop under Windows 11 exactly following the description by Aurélien. Basically that build works fine.

But now here goes the problem that makes me really stuck. When it comes to performance tuning (per default, the UHD was used, not the GeForce :thinking:), I followed the instructions from the darktable doc and read tons of very good postings here on the forum (thanks by the way to all involved) to get my setup tuned.

At the moment, I’m really a little lost. With the current parameter settings I have a very acceptable performance (with respect to the system hardware) when exporting RAW to JPEG at full resolution from the lighttable. During export, the GeForce GPU is mostly near 100% and at no time, the CPU is forced to take over the job. Well done.

After being quite happy at this point, I opened the RAW the first time in darkroom. The GUI shows the picture, but also “working” which will never come to an end. In the log (using darktable -d perf -d opencl) I have a repeating list of

[histogram] took 0,000 secs (0,000 CPU) scope draw

but there is no further hint that would help me to get out of this stuck situation. So I’m asking here, if anyone may help me to figure out what I need to further tune on the configuration. Any hints appreciated, also feel free to tell me which information you would need to get more insights. I didn’t want to put all information at hand to this initial post for not overloading it.

Thanks to all in advance - have a great and hopefully peaceful 1st Advent (if applicable to you :smiley:), bye
Lars.

Moinchen, Lars!

I am not a Windowsian, so please bare with me…

First, what does the command darktable-cltest report?

MfG
Claes in Lund, Schweden

Hi @Claes thanks for picking this up :smile: and no problem, it’s all just OSes :wink:

output is as follows:

[dt_get_sysresource_level] switched to 1 as `default'
  total mem:       8043MB
  mipmap cache:    1005MB
  available mem:   7462MB
  singlebuff:      62MB
  OpenCL tune mem: WANTED
  OpenCL pinned:   OFF
[opencl_init] opencl related configuration options:
[opencl_init] opencl: ON
[opencl_init] opencl_scheduling_profile: 'very fast GPU'
[opencl_init] opencl_library: 'default path'
[opencl_init] opencl_device_priority: '*/!0,*/*/*'
[opencl_init] opencl_mandatory_timeout: 200
[opencl_init] opencl_synch_cache: false
[opencl_init] opencl library 'OpenCL.dll' found on your system and loaded
[opencl_init] found 2 platforms
[opencl_init] found 2 devices

[dt_opencl_device_init]
   DEVICE:                   0: 'NVIDIA GeForce GTX 1050 with Max-Q Design'
   CANONICAL NAME:           nvidiageforcegtx1050withmaxqdesign
   PLATFORM NAME & VENDOR:   NVIDIA CUDA, NVIDIA Corporation
   DRIVER VERSION:           526.98
   DEVICE VERSION:           OpenCL 3.0 CUDA, SM_20 SUPPORT
   DEVICE_TYPE:              GPU
   GLOBAL MEM SIZE:          4096 MB
   MAX MEM ALLOC:            1024 MB
   MAX IMAGE SIZE:           16384 x 32768
   MAX WORK GROUP SIZE:      1024
   MAX WORK ITEM DIMENSIONS: 3
   MAX WORK ITEM SIZES:      [ 1024 1024 64 ]
   ASYNC PIXELPIPE:          YES
   PINNED MEMORY TRANSFER:   NO
   MEMORY TUNING:            WANTED
   FORCED HEADROOM:          0
   AVOID ATOMICS:            NO
   MICRO NAP:                250
   ROUNDUP WIDTH:            16
   ROUNDUP HEIGHT:           16
   CHECK EVENT HANDLES:      1024
   PERFORMANCE:              0.783661
   DEFAULT DEVICE:           NO
   KERNEL DIRECTORY:         C:\Program Files\darktable\share\darktable\kernels
   CL COMPILER OPTION:       -cl-fast-relaxed-math
   KERNEL LOADING TIME:       0.0430 sec

[dt_opencl_device_init]
   DEVICE:                   1: 'Intel(R) UHD Graphics 620'
   CANONICAL NAME:           intelruhdgraphics620
   PLATFORM NAME & VENDOR:   Intel(R) OpenCL HD Graphics, Intel(R) Corporation
   DRIVER VERSION:           31.0.101.2114
   DEVICE VERSION:           OpenCL 3.0 NEO
   DEVICE_TYPE:              GPU
   GLOBAL MEM SIZE:          3217 MB
   MAX MEM ALLOC:            1609 MB
   MAX IMAGE SIZE:           16384 x 16384
   MAX WORK GROUP SIZE:      256
   MAX WORK ITEM DIMENSIONS: 3
   MAX WORK ITEM SIZES:      [ 256 256 256 ]
   ASYNC PIXELPIPE:          NO
   PINNED MEMORY TRANSFER:   NO
   MEMORY TUNING:            WANTED
   FORCED HEADROOM:          400
   AVOID ATOMICS:            NO
   MICRO NAP:                250
   ROUNDUP WIDTH:            16
   ROUNDUP HEIGHT:           16
   CHECK EVENT HANDLES:      1024
   PERFORMANCE:              1.634468
   DEFAULT DEVICE:           NO
   KERNEL DIRECTORY:         C:\Program Files\darktable\share\darktable\kernels
   CL COMPILER OPTION:       -cl-fast-relaxed-math
   KERNEL LOADING TIME:       0.0367 sec
[opencl_init] OpenCL successfully initialized.
[opencl_init] here are the internal numbers and names of OpenCL devices available to darktable:
[opencl_init]           0       'NVIDIA GeForce GTX 1050 with Max-Q Design'
[opencl_init]           1       'Intel(R) UHD Graphics 620'
[opencl_init] FINALLY: opencl is AVAILABLE on this system.
[opencl_init] initial status of opencl enabled flag is ON.
[dt_opencl_update_priorities] these are your device priorities:
[dt_opencl_update_priorities]           image   preview export  thumbs
[dt_opencl_update_priorities]           0       0       0       0
[dt_opencl_update_priorities]           1       1       1       1
[dt_opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[dt_opencl_update_priorities]           image   preview export  thumbs
[dt_opencl_update_priorities]           1       1       1       1
[opencl_synchronization_timeout] synchronization timeout set to 0
1 Like

I don’t know what AP did with Rdarktable and windows. So i cant help too much. But the Nvidia has the forced headroom set to 0 and that can give you problems (it does for me). The default is 400 in darktable.

Hi @g-man thanks for this information. I thought that setting this to 0 makes darktable determine the real amount of RAM on the GPU when running the pixel pipe the first time.

Anyway, so I set it to 400 now, just for testing. This is the outcome when exporting four pictures in a row:

  • first export fine, all running on GPU
  • second export threw a message => switching over to CPU

235,005632 [dev_pixelpipe] took 0,087 secs (0,062 CPU) processed `color calibration’ on GPU, blended on GPU [export]
236,550680 [dt_opencl_enqueue_kernel_2d_with_local] kernel 178 on device 0: CL_MEM_OBJECT_ALLOCATION_FAILURE
244,743754 [opencl_diffuse] couldn’t enqueue kernel! -4
244,746175 [default_process_tiling_opencl_ptp] couldn’t run process_cl() for module ‘diffuse’ in tiling mode: CL_SUCCESS
244,746209 [opencl_pixelpipe] could not run module ‘diffuse’ on gpu. falling back to cpu path

After export finished, I opened one of these images in darkroom and am again stuck in this loop

[histogram] took 0,000 secs (0,000 CPU) scope draw

:crazy_face:

What if you…

  • make a safety copy of your present darktable .config folder,
  • save it somewhere – because you may have to copy it back :slight_smile:
  • erase your present darktable .config folder, and
  • start darktable again? I.e. now using darktable’s default, basic settings?

That probably makes the UHD default again — but how does darktable react now?

1 Like

I’m not sure what’s going on with the d&s module. Again, i don’t know what AP did with Rdarktable. Instead try using darktable from master (nightly builds for windows) since it has more support/developers. At least this way you can rule out a card issue vs Rdarktable issue.

In regards to the histogram, it is normal for it to run on CPU. In important stuff (full,export) use the GPU, and the not so resource intensive would use the CPU.

2 Likes

Thanks @Claes and @g-man I will try and come back with what I found out (may take some time)

Well here comes the update: before I went the road of using official darktable (as @g-man suggested), I tried moving the config. Voilà, I had a reasonable starting point again, from which I proceeded then.

R-Darktable immediately found and set “very fast GPU” and at the same time automatically deactivated the UHD 620 GPU. Then I fiddled with the parameters and at the end I have now the following working GPU configuration:

cldevice_v4_nvidiageforcegtx1050withmaxqdesign=0 250 0 16 16 1024 0 0 0.037966
cldevice_v4_nvidiageforcegtx1050withmaxqdesign_id0=0

additionally I tweaked the resource settings as follows

resource_default=950 8 128 975

This leads to stable fluent work during RAW processing and reasonable export times and fully using the GPU memory. Perhaps I will try to play with the tiling parameters, but I’m not sure if this will lead to any improvement. Setting asynchronous mode didn’t change anything regarding export times.

Again thanks so much @Claes and @g-man, you made my day :smile:.

Currently I have another new side-effect: shortcuts are no longer working. e.g. typing ALT+1 shows a message “alt+1 not assigned”. Hm, any idea on that?

1 Like

No clue on the shortcuts. Maybe start an issue in Rdarktable GitHub.

1 Like

@Macchiato17: you would probably be better off with ‘core’ darktable. For details, see Future of Modules developed by AP in DT??? - #8 by jorismak. Any follow-up discussion related to the non-OpenCL aspect should be on that other thread.

1 Like

(i don’t think r&dt and normal dt have differences in opencl . Maybe the auto detection, but even then parameters can be migrated. ).

If your gpu is much quicker than your cpu, ‘very fast gpu’ seems logical to me. In my experience the uhd Intel (620) seems to help in heavy modules like d&s, but can make the entire system laggy.

Affinity photo for example changed more and more to the gpu and opencl code. Then there was a major change where suddenly the program was slow as snails on a uhd Intel . Turning off opencl made things way quicker. So there is a cutoff point where the memory transfers or something else with the uhd620 where performance gets worse instead of better .

1 Like

There are some changes in dt 4.2

On my system I have played around with a lot of the settings. I don’t have the integrated or 2nd GPU just an 8Gb 3060Ti… For me setting the micronap to 0 made a massive improvement and it seems stable to do so. I have possibly tweaked one or two other settings but I use large for resources at the defaults and no tuning… I found that all other settings were slower than using none for tuning on my system… I left scheduling to the default but I use very fast GPU which I think takes over the scheduling , I think?? So for me that was it. I had 3 images each with its own xmp…one with lots of D&Sharpen and one with some other denoising … I would run each from a batch after changing the parameters around and so that what I landed one as the fastest… ie Very Fast GPU, tuning none, resources large and micronap =0 I may have bump up the wait time to switch to CPU slightly but I think it is generous at around 2 seconds now … just going from memory on that one…

PS…Unrestricted was a tad slower than large…

My CPU is a 12600K and I have 32GB of DDR5…

Try changing your micronap to 0 and see if you notice a bump in performance

Yes - that observation is rather common :slight_smile: There seem to be some misunderstandings around, the “tuned” settings resource_default=950 8 128 975 have been suggested - or something like that - here in this forum. By doing so you “try to get everything”. Seems to be good as long as cpu/gpu don’t take that much, if they come to the limits you might observe a) missed buffer allocations in OpenCL thus falling back to cpu (was mentioned above too) or b) you take all for module processing but have to swap other apps or dt iop cache which is far worse than an occasional tiling. So mostly large is a better setting than unrestricted or trying to “tune”. BTW there is a chapter in the manual explaining this. But - as you like personally. I would never suggest such default settings to others.

Micronap of zero is pretty safe with good cards or if the cpu takes the desktop. With slow gpu cards dt will be marginally slower but your system desktop might have problems updating screen thus feeling “laggy”.

Thanks for your comments… For sure everyone is going to have to experiment with their own hardware to access performance vs stability… I set micronap to zero and noticed a big improvement…. Expecting to have to go to default if I had issues… It has been a few months and seem fine for me… otherwise pretty std for me Fast GPU and Large for resources seemed to be about the best I could do

Thanks for your suggestions as well. I found that micro nap setting to zero didn’t change something really measurable on my system. But since it didn’t “crash” the workflow, I simply left it at zero knowing that it might perhaps the first setting to be changed back if the system gets weird :grin:

These are the files I compare to assess the tweaks… A couple of play raw files and a phone image.

The command line batch is also provided.

I use an alternate config directory for testing . If you modify the command line you can also disable opencl to compare against it being on…
bench.SRW.zip (76.7 MB)

Edit…just tried with micronap set to 0 as I normally have it and the time for processing was 14.1 seconds for my nikon test image… With it at the default over 1.5 seconds slower so not a bad little bump…

Gain for sony file and xmp were the least 6.454 vs 6.1 and bench file 4.332 vs 2.9…

So for these example net gains may be fairly small but its still a gain and as a percentage not too bad…

One thing I don’t understand is that the manual says, “The configuration parameter “opencl_device_priority” holds a string with the following structure: a,b,c.../k,l,m.../o,p,q.../x,y,z.... Each letter represents one specific OpenCL device. There are four fields in the parameter string separated by a slash, each representing one type of pixelpipe.”

But in my darktablerc I see five:
opencl_device_priority=0,*/!0,*/0,*/0,*/!0,*