darktable 3.4/3.5 opencl slow on Windows 10

I have a Win10 laptop and 3.4.1 with these specs:

i7-8565u, Nvidia Geforce MX250 2gb (latest driver), Intel UHD 620, 32gb ram, 512gb SSD, 1920x1200 monitor

I opened up a 20mp Olympus PEN-F raw file and used denoise (profiled) using your settings. Even with my low power MX250 the screen update only takes 1-2 seconds.


dn

Good luck. I hope you can find the solution to the problem you are having.

Should you be on default in OPENGL DT setting or would the faster graphics settings work better?I still don’t think this is the problem I have a crappy older NVIDIA card that uses OPENGL with ON1 photo raw and it seems to work fine. I think you should maybe try as @kofa says to benchmark it and see if the numbers seem okay just for the hardware as it is configured. GpuTest - Cross-Platform GPU Stress Test and OpenGL Benchmark for Windows, Linux and OS X | Geeks3D.com

Also are there any setting that could be needed in the bios. You might just want to review those to the best of your ability in case there is something weird there like the cache is disable or something stupid…just grasping at straws but you never know

EDIT…maybe see if you have a bios update??

Any chance your motherboard has onboard gpu?? If so maybe be sure it is disabled If your system benchmarks without DT involved are slow maybe review your BIOS settings one by one and if there is a bios update maybe give that a try. Hope you find the problem…must be very frustrating

NVidia provides also some tests especially for opencl

If those run fine you can at least exclude that your issue is hardware / bios / driver related.

2 Likes

Guys, I am telling you this is not a problem that just I have. The new Nvidia drivers are broken. But it is difficult to notice it if you don’t have both Linux and Windows.
What I am trying right now: I am downloading the oldest driver that is still available form Nvidia (457/451). Let’s see if that one is broken too.

I also applied some other modules since I had to create some noise to remove. I’d say 2 seconds for denoise only is quite slow.

When you use your MX250 how long does it take?

On Windows, I think I get the same performance as you 1.something seconds if only basecurve + denoise non local auto are active.
I think there is no significant difference between with and without opencl.
I am about to download msys2 so I can actually measure the performance on Windows.
On Debian, dt seems to be a bit faster, but there is no difference between opencl and no opencl either. Actually according to darktable -d perf, without opencl it is even slightly faster, I measured 0.9 seconds. With opencl seems to be 1.1 seconds.

Edit: On Windows, 2.0 seconds with opencl and 1.5 seconds without opencl (according to darktable -d perf).

I just did some performance tests with my systems, too. I do have two Linux-Systems running the same darktable versions here: an old one (I7-4700HQ with Geforce GT 750M) and a newer one (I7-7820HQ with Quadro M1200) and I can observe similar behaviours, too.
The 4700HQ system is faster with opencl disabled while the 7820HQ system is faster with opencl enabled.

I took an example image and run the export from darktable with different settings:
4700HQ-GPU-enabled: pixel pipeline processing took 46.309 secs (59.258 CPU)
4700HQ-GPU-disabled: pixel pipeline processing took 29.932 secs (222.965 CPU)
so the 4700HQ system is almost 50% faster when not using the GPU

while
7820HQ-GPU-enabled: pixel pipeline processing took 12,010 secs (20,289 CPU)
7820HQ-GPU-disabled: pixel pipeline processing took 21,378 secs (162,917 CPU)
so the 7820HQ system is almost double the speed having the GPU enabled

nevertheless denoise on all system took round about 2/3 of processing time independently whether GPU was used or not.

Looks like the relationship between CPU- and GPU-performance is very important here. If you have a fast CPU but a low to medium fast GPU enabling GPU does not help much in processing, in contrary it might even run slower.

@betazoid your CPU is just so fast that the GPU does not give you an additional boost in performance. The only chance I see is to distribute CPU/GPU power for calculating preview and full image as @MStraeten already suggested.

2 Likes

I don’t know anything about how nvidia cl stuff works on windows. Can only speak for darktable on linux.

There have been quite a number of performance gain achieved in current master, especially if you are using the release version as that uses -O3 which vectorises much better leading to a performance gain of up to 50% for some modules. That depends a lot on the cpu hardware you have but in general dt cpu modules got really faster.

For OpenCl this has not changed, some modules are very good on opencl, some are not. In general, some modules OpenCL code performance depends heavily on the graphics card memory transfer speed. The profiled denoise is the best example. So a 750M card will likely not be faster for that module than a current cpu, the quadro 1200M is faster but also on a not-so-fast memory bus.

If you have a dedicated graphics card with fast ram & 256bit wide bus, the story will be very different.

In general, if you don’t have an exceptionally fast graphics card you will be likely better off distributing load between cpu and gpu as @MStraeten suggested.

3 Likes

I noticed that there were performance optimizations in dt 3.4 and 3.6. When I bought my laptop, the stable version was 2.6 I think. At that time, almost everything was better on the GPU. I think meanwhile there are more modules that are (heavily) multithreaded if they use the CPU.
So dt 3.4 and 3.6 performance is not so terrible any more without a good GPU.
Well, time flies…
Anyway, dt 3.5 is quite fast with opencl on Linux on my new desktop PC. In general, it needs about 1/3 of the time that is needs with CPU only processing. It’s most noticeable when I add more modules.
But there is a performance problem on Windows. Yesterday, when I tested on my laptop that had an old Nvidia driver installed, the performance seemed to be ok.
The performance results seem to be different and sometimes confusing on my laptop, but one thing is sure: performance is significantly better on Linux, with and without opencl.

1 Like

You may want to try the following:

  1. stop darktable
  2. remove opencl_scheduling_profile from your .config/darktable/darktablerc
  3. restart darktable

And/or set it manually:

1 Like

Does anybody know a filter in Gimp that is suited for performance testing? Afaik Gimp does use opencl.

Have you been able to just benchmark opencl itself…not GIMP DT whatever just to be sure that your starting with a baseline idea before attempting to troubleshoot its implementation in the software…

I found this LuxMark v3 - LuxCoreRender Wiki
Looks like it would work your system and give you an idea

FOund it here

Sorry it this is not of any use…

2 Likes

So I ran Geekbench on both Windows and Linux, the results are quite similar. So apparently Opencl is not broken on Windows?

This is Linux:

And this Windows:

The Luxmark result is on Windows 17041 and 16879 and on Linux I also ran it approximatley twice but I only wrote down the result once, 16420 (the first result was also 16something).
So it really looks like there is nothing wrong with Opencl on Windows.

Are there versions of opengl would you have one that is new and not supported or recognized by the software. I have not idea but others will likely understand better. I think I saw something on a site that your card supported v 4.63?? or something…Hopefully someone can help if that is even a possibility??

1 Like

opengl version is 4.6. What has opengl to do with opencl?
As mentioned above, I also tried to use a version of darktable that I compiled specifically for this system, shouldn’t that help with this?

Edit: On Debian, opengl version is also 4.6.

Just a typo sorry

OpenCL kernels are always compiled locally, when you start darktable (and it does not find the compiled versions matching your current video driver). You should see that if you run with the -d opencl command-line argument, but only if you first delete the already compiled versions (on Linux: ~/.cache/darktable/cached_kernels_for_{name of device}_{version of driver}, e.g. for me: cached_kernels_for_GeForceGTX10606GB_46080).

Output with compilation:

0.039812 [opencl_init] opencl related configuration options:
0.039828 [opencl_init] 
0.039831 [opencl_init] opencl: 1
0.039836 [opencl_init] opencl_scheduling_profile: 'default'
0.039840 [opencl_init] opencl_library: ''
0.039843 [opencl_init] opencl_memory_requirement: 768
0.039847 [opencl_init] opencl_memory_headroom: 400
0.039849 [opencl_init] opencl_device_priority: '*/!0,*/*/*/!0,*'
0.039853 [opencl_init] opencl_mandatory_timeout: 200
0.039856 [opencl_init] opencl_size_roundup: 16
0.039858 [opencl_init] opencl_async_pixelpipe: 1
0.039860 [opencl_init] opencl_synch_cache: active module
0.039865 [opencl_init] opencl_number_event_handles: 1000
0.039868 [opencl_init] opencl_micro_nap: 0
0.039871 [opencl_init] opencl_use_pinned_memory: 0
0.039873 [opencl_init] opencl_use_cpu_devices: 0
0.039876 [opencl_init] opencl_avoid_atomics: 0
0.039878 [opencl_init] 
0.040083 [opencl_init] found opencl runtime library 'libOpenCL'
0.040105 [opencl_init] opencl library 'libOpenCL' found on your system and loaded
0.067237 [opencl_init] found 1 platform
0.067246 [opencl_init] found 1 device
0.067456 [opencl_init] device 0 `GeForce GTX 1060 6GB' has sm_20 support.
0.067616 [opencl_init] device 0 `GeForce GTX 1060 6GB' supports image sizes of 16384 x 32768
0.067620 [opencl_init] device 0 `GeForce GTX 1060 6GB' allows GPU memory allocations of up to 1519MB
[opencl_init] device 0: GeForce GTX 1060 6GB 
     GLOBAL_MEM_SIZE:          6077MB
     MAX_WORK_GROUP_SIZE:      1024
     MAX_WORK_ITEM_DIMENSIONS: 3
     MAX_WORK_ITEM_SIZES:      [ 1024 1024 64 ]
     DRIVER_VERSION:           460.80
     DEVICE_VERSION:           OpenCL 1.2 CUDA
0.124780 [opencl_init] options for OpenCL compiler: -w -cl-fast-relaxed-math  -DNVIDIA_SM_20=1 -DNVIDIA=1 -I"/home/kofa/darktable-master/share/darktable/kernels"
0.124873 [opencl_init] compiling program `demosaic_ppg.cl' ..
0.124915 [opencl_fopen_stat] could not open file `/home/kofa/.cache/darktable/cached_kernels_for_GeForceGTX10606GB_46080/demosaic_ppg.cl.bin'!
0.124919 [opencl_load_program] could not load cached binary program, trying to compile source
0.124922 [opencl_load_program] successfully loaded program from '/home/kofa/darktable-master/share/darktable/kernels/demosaic_ppg.cl' MD5: '873aa05f976ebda5de7eee1601037421'
0.127093 [opencl_build_program] successfully built program
0.127098 [opencl_build_program] BUILD STATUS: 0
0.127100 BUILD LOG:
0.127100 

0.127104 [opencl_build_program] saving binary
... repeated for the other kernels ...
0.160316 [opencl_init] kernel loading time: 0.0355 
0.160321 [opencl_init] OpenCL successfully initialized.
0.160323 [opencl_init] here are the internal numbers and names of OpenCL devices available to darktable:
0.160324 [opencl_init]          0       'GeForce GTX 1060 6GB'
0.160326 [opencl_init] FINALLY: opencl is AVAILABLE on this system.
0.160327 [opencl_init] initial status of opencl enabled flag is ON.
0.160335 [opencl_create_kernel] successfully loaded kernel `blendop_mask_Lab' (0) for device 0
0.160339 [opencl_create_kernel] successfully loaded kernel `blendop_mask_RAW' (1) for device 0
0.160343 [opencl_create_kernel] successfully loaded kernel `blendop_mask_rgb_hsl' (2) for device 0
... repeated for the other kernels ...

If you already have them, you won’t have the compilation step, only the loading messages.

1 Like

Exactly. And there is only one cache for these kernels per video driver. So if you change the video driver the recompilation is enforced. But if you recompile dt and the new dt requires some updated kernels you will run into problems seen via the -d opencl option. If there is a kernel mismatch you will either end up with visual bugs or a fallback to cpu code - the fallback takes time again. BTW the installers make sure the opencl kernels are valid, but if you compile yourself on current master …