Darktable OpenCL performance with NVidia GTX1650 Super

Hello all,

I would like to kindly ask you for help with issues I’m observing on my new PC build. I decided to build the PC specifically for Darktable and I’m trying to find the best GPU for OpenCL within my budget.

Software-wise, my setup is: Linux Mint 20, Darktable 3.2.1

At first I bought an AMD RX 570 8GB and got some pretty decent results when compared with processing on CPU only. I have used the open source AMD drivers with the necessary OpenCL part installed from the AMDGPU-PRO package.
Just few days after byuing the AMD card I found Nvidia GTX 1650 Super on sale. Overall, the 1650 Super shall be better performing GPU, although having only 4GB of VRAM instead of 8GB - but from profiling Darktable exports with the AMD GPU, the VRAM consumption was moving between few hundred of MB up to 2.4GB, therefore I thought 4GB VRAM will be ok. So I decided to buy that one as well, keep the better performing one and return the other one.

I made some basic setup to compare the cards - I selected 30 edited photos and used darktable-cli to export them while logging the debug and profiling info, also watching the GPU resources and measuring the total duration of the whole operation.

With the Nvidia GPU, I installed the proprietary nvidia-driver-440. The OpenCL support worked out of the box. However, the results of my tests surprised me - in my experiments, the same set of exports took approx. 30% longer than the older and slower AMD card. Also, from the log I can see that the Nvidia only allows VRAM allocations only up to 977MB. And, from the benchmarks that can be found here (GPU benchmarks in darktable), it seems that Nvidia GPUs often allow only approx. 1/4 of their VRAM for OpenCL memory allocations.

Has anyone here observed such a low performance with Nvidia GPUs with Darktable on Linux? Would you have some recommendations on how to improve it, maybe also how to trick it to allow it to allocate more memory?

Below are the initial parts of the Darktable logs:

AMD:

[opencl_init] opencl library 'libOpenCL.so.1' found on your system and loaded
[opencl_init] found 1 platform
[opencl_init] found 1 device
[opencl_init] device 0 `Ellesmere' supports image sizes of 16384 x 16384
[opencl_init] device 0 `Ellesmere' allows GPU memory allocations of up to 6565MB
[opencl_init] device 0: Ellesmere 
     GLOBAL_MEM_SIZE:          7950MB
     MAX_WORK_GROUP_SIZE:      256
     MAX_WORK_ITEM_DIMENSIONS: 3
     MAX_WORK_ITEM_SIZES:      [ 1024 1024 1024 ]
     DRIVER_VERSION:           3143.9
     DEVICE_VERSION:           OpenCL 1.2 AMD-APP (3143.9)
[opencl_init] options for OpenCL compiler: -w  -DAMD=1 -I"/usr/share/darktable/kernels"

Nvidia:

[opencl_init] opencl library 'libOpenCL.so.1' found on your system and loaded
[opencl_init] found 1 platform
[opencl_init] found 1 device
[opencl_init] device 0 `GeForce GTX 1650 SUPER' has sm_20 support.
[opencl_init] device 0 `GeForce GTX 1650 SUPER' supports image sizes of 32768 x 32768
[opencl_init] device 0 `GeForce GTX 1650 SUPER' allows GPU memory allocations of up to 977MB
[opencl_init] device 0: GeForce GTX 1650 SUPER 
     GLOBAL_MEM_SIZE:          3909MB
     MAX_WORK_GROUP_SIZE:      1024
     MAX_WORK_ITEM_DIMENSIONS: 3
     MAX_WORK_ITEM_SIZES:      [ 1024 1024 64 ]
     DRIVER_VERSION:           440.100
     DEVICE_VERSION:           OpenCL 1.2 CUDA
[opencl_init] options for OpenCL compiler: -w  -DNVIDIA_SM_20=1 -DNVIDIA=1 -I"/usr/share/darktable/kernels"

I’m no expert, but my new PC has an RTX 2060 with 6 GB. OpenCL did not work with the default nouveau driver. When I installed the latest proprietary nVidia driver, darktable detected the OpenCL capability without my doing anything else.

Now, when pixel pushing on a large image file, the operations that make darktable say “waiting…” are finished in the blink of an eye. :heart_eyes:

1 Like

Does OpenCL and which ever GPU have anything to do with exports? I thought not. I thought the GPU was all about faster displays when pushing sliders left and right. But not about file saving.

I’d love to be wrong about that. I’m shopping for a new graphics card (Mint linux).

Yes. OpenCL speeds up export as well. You can see by opening with darktable -d perf.

I have the same card and am seeing the same memory for OpenCL being reported.

I’m using a GeForce RTX 2060 SUPER.
My darktable log shows also that I’m not using the full amount of memory of the card:

[opencl_init] found 1 device
[opencl_init] device 0 `GeForce RTX 2060 SUPER' has sm_20 support.
[opencl_init] device 0 `GeForce RTX 2060 SUPER' supports image sizes of 32768 x 32768
[opencl_init] device 0 `GeForce RTX 2060 SUPER' allows GPU memory allocations of up to 1994MB
[opencl_init] device 0: GeForce RTX 2060 SUPER 
     GLOBAL_MEM_SIZE:          7979MB
     MAX_WORK_GROUP_SIZE:      1024
     MAX_WORK_ITEM_DIMENSIONS: 3
     MAX_WORK_ITEM_SIZES:      [ 1024 1024 64 ]
     DRIVER_VERSION:           440.100
     DEVICE_VERSION:           OpenCL 1.2 CUDA

Maybe there are specific options in the darktablerc to be changed?

Just some background info: GPUs have immense computing power (due to being able to execute several operations concurrently), which can be utilised not only to output to the display, but to do all kinds of calculations. The term ‘GPGPU’ stands for ‘general-purpose computing on GPUs’; in fact, there are GPUs that offer no video output at all, and are only used for computing. The use of GPUs for mining Bitcore (and other cryptocurrencies) was the reason of high GPU prices in the recent past.

1 Like

@pittendrigh: yes, as noted here and as said by Frank, OpenCL also has impact on the export speed. The approach in the link is the most reliable measurement of the OpenCL performance I know, but if anybody would suggest different method, I will be happy.
@wiegemalt: There are some options in Darktable that allow tuning of OpenCL (here and here). I was playing with few of them in the past but have not seen any noticeable differences. Also, as the reported amount of memory available for allocations is reported by opencl_init, I would guess it has nothing to do with Darktable - to me it sounds that the limit is somewhere between the top-level OpenCL library (in my case libOpenCL.so.1) and the hardware - i.e. somewhere in the drivers.

To me it still remains mystery why there is so little amount of VRAM available for allocations with Nvidia cards…

But, is that “Allows GPU memory alloctions up to” the maximum per allocation or the sum of all allocations?

@rvietor That is a good question. According to this post, it shall be “the upper size limit for any single specific allocation”. So that doesn’t sound that bad…
Nevertheless - even if this is not the limiting factor in performance, it is weird that the GTX 1650 Super performs significantly slower than the RX 570… would anyone suggest how to get more performance from the Nvidia GPU with Darktable?

The thing is that according to Geekbench’s OpenCL benchmarks, the GTX 1650 Super shall perform more than 20% better than the RX 570. So either the Nvidia linux drivers suck hard or I’m doing something wrong.

There is one more thing that might show the issue - in the logs I’ve shown before, for the Nvidia, there’s
MAX_WORK_ITEM_SIZES: [ 1024 1024 64 ], while the AMD reports
MAX_WORK_ITEM_SIZES: [ 1024 1024 1024 ]… could that be a reason? Or does it directly relate to the discussed memory allocations?

I have double checked the Darktable logs to see if there is not mentioned any fallback to CPU processing (which is what Darktable shall do in case of OpenCL issues, and it would explain the slow operations). But it’s not the case, everything seems to be fine - only display encoding operation is performed on CPU, but that’s also the case for the AMD GPU.

I also tried downgrading the drivers to 435, 430 and 390 and have seen no improvements…

You had to tweak correctly the 2 darktablerc. My card use for opencl more than 80% of his memory

Which card and which tweaks?

AMD or nvidia. In .config, there is darktablerc, the configuration file for darktable and in the filesystem. https://www.darktable.org/usermanual/en/darktable_and_opencl_amd.html

1 Like

Your link discusses OpenCL settings for AMD cards. This thread is discussing the behavior of a specific Nvidia card–the 1650 Super. Are you saying that the settings for AMD in your link will increase the OpenCL memory used for the 1650 Super?

The link is the darktable usermanual, if you go down the page, you will see the next chapter about “OpenCL performance optimization”…

Ah, sorry, I didn’t realize that you meant the link to be a generic link to the manual. I’m familiar with the manual. I’m also familiar with the optimization for OpenCL as I’m a long time user of darktable. However, none of changes I make seem to affect the reported memory used. Perhaps I’m reading the recommendations in the manual a bit too literally. However,since you seem to have found the correct optimization to allow 80% OpenCL memory usage with the 1650 Super I’m wondering if you might share them. Thanks.

I have a gtx 1050 Ti 4Gb. My tweaks follow the usermanual:
opencl_async_pixelpipe=true
opencl_avoid_atomics=false
opencl_mandatory_timeout=200
opencl_memory_headroom=1024
opencl_memory_requirement=2048
opencl_micro_nap=10
opencl_number_event_handles=1000
opencl_scheduling_profile=very fast GPU
opencl_size_roundup=16
opencl_synch_cache=false
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false
opencl_memory_requirement=2048
I tweak the 2 darktablerc. Each time, you upgrade, reinstall, compile darktable, you had to tweak again the 2 darktablerc. Only the darktablerc in the …/.config/darktable had to be tweaked

1 Like

Thanks! I appreciate it. I’ll give these settings a go and see how it turns out.

Interesting, I thought that bashrc settings (I read the manual in the past) can affect how Darktable initializes and treats the “top layer” of the OpenCL API, but it can not affect the amount of memory which OpenCL offers for processing - I would think that allowing to change this limit via a public interface is not very safe and reasonable. I may be wrong though. @Frank_Lepore, could you please report if you have successfully changed the amount of memory available?
Also, please notice that in the tweaks from @dim, the “opencl_memory_requirement” is set twice, which is not necessary and may be confusing.

I have to admit that I gave up and got back to the Radeon GPU - simply because the Nvidia was cca 30% slower in my tests (and it’s not sure if increasing the amount of max memory allocations would improve the performance). but I still am curious if one can change the amount of memory available.

1 Like