OpenCl multiple GPUs, memory and tuning

obe · December 7, 2023, 11:08am

Thank you guys for all your input.
Based on this input I have deducted the following tuning darktable (some is specific to my pc):

Make sure that as much as possible is processed on the NVIDIA GPU. That means turn on OpenCL and prevent the Intel GPU from being used
There are enough resources to edit images like the demo image effectively (no tiling). The speed of the GPU is the limiting factor.
Exporting is very hard on resources and runs for a very long time if processing on the NVIDIA GPU is not possible (due to lack of memory). Even tiling slows down the export a lot.
The NVIDIA card is seldom used in everyday situations (only when processing video, Photoshop, NORTON security scanning and darktable and maybe few other special cases)

So:
The default OpenCl setting is not ideal in this situation since it explicitly prevents the NVIDIA GPU from processing the preview.
There are a number of ways to prevent the Intel GPU from being used. The INTEL GPU can be removed from the system or deactivated with possible consequences for battery life and other problems.
Use of the INTEL GPU can also be prevented with several different settings in darktable. I’m not sure whether any one method is clearly superior to others. It seems that darktable all by itself selects the NVIDIA card if allowed.

Since the NVIDIA GPU is only used in special situations, it seems safe to try to use max. memory. I tried the following settings: OpenCL performance = “memory size” in the presets and forced headroom = 1 in the darktablerc

This caused the export process time to drop from 79 to 59 sec. with no tiling except for the “diffuse”-module. Processing “diffuse” dropped from 64 to 45 sec to process.
Exporting really appreciates plenty of GPU-memory.

stuntflyer · December 7, 2023, 2:32pm

It just seems odd that your export takes so long. Before replacing with the RTX 2060 6MB, my old GTX1050 2MB card was able to export a 16MB NEF to 16 bit tiff is just under 5 sec.

I wonder if there is something else on your system that’s bottle necking performance in darktable.

hannoschwalm · December 7, 2023, 2:57pm

Begging for trouble

obe · December 7, 2023, 3:16pm

Any ideas on what and how to investigate?

hannoschwalm · December 7, 2023, 4:02pm

We are turning around. That’s why i suggested to use “very fast GPU sheduler”

You will certainly run into problems doing so. CL code will abort due to stressing memory too much so falling back to cpu.

Don’t know how to help any further …

priort · December 7, 2023, 5:37pm

Is the OS set for powersaving ?? There are ways to ask for high performance…not sure how much it would help but you can specify that in the power and the gpu/display settings to be sure it is being used… this was also pointed out recently by a user of ON1 Photoraw software that this boosted performance dramatically…

I will drop your battery again but there is no free lunch when you want to go faster…

I posted the screenshots above… select DT and set it to run at high performance…

Power scheme could be edited as well… I think there is one you can enable called ultimate… this should let things run as fast as possible but will as mentioned cut the battery life…

I think there are also settings in the NVIDIA control panel… something called battery boost slows the card and I think it has a setting to prefer high performance as well… not sure how these interact with the OS settings…

stuntflyer · December 7, 2023, 7:33pm

@obe, isn’t this the latest driver for your card? The link came up when I put your card in the search criteria

kofa · December 7, 2023, 8:54pm

@obe
You could also try:

opencl_scheduling_profile=default
opencl_device_priority=+0/+0/+0/+0/+0

Use the ID of the card to use in place of 0. This will also direct all processing to the specified card. The ‘priority’ setting is only used if the scheduling profile is set to default.

If your GPU is processing a module for a long time, and darktable has another pipeline to run, it will first wait (if the priority says using the GPU is mandatory), and then give up and fall back to the CPU anyway. For me, this usually meant lost time, because if darktable waited longer, the GPU could still have finished the task faster. To extend the timeout, you can set a large value here:

opencl_mandatory_timeout=20000

Details here: How cheap a GPU will still speed up darktable? - #62 by kofa.

If in doubt, just follow the guidance from Hanno. He knows what he’s talking about. For NVidia, in general: no tuning; and don’t try to use unlimited memory.

obe · December 7, 2023, 9:26pm

Thank you for your response.
I think you are right, but I downloaded the newest driver one week ago. At that time the newest version was version 546.17. That didn’t change performance. So I doubt that it will change much upgrading to version 546.29…

stuntflyer · December 7, 2023, 9:35pm

I have disabled the onboard GPU in the bios. I was wondering if it there would be any benefit, only in regard to darktable, to use this setting in Win 10>settings>graphic settings?

Capture

g-man · December 7, 2023, 9:46pm

Start darktable with - d perf and export an image with your typical processing. Record the time to export. Turn that off, restart and repeat. Compare export time for the same image.

stuntflyer · December 7, 2023, 10:20pm

@g-man, I just ran the test as you described. The results are within .001 secs across the board whether the hardware-acceleration is turned on or off. 50% of the results were identical. I can see no reason to turn this feature on.

stuntflyer · December 7, 2023, 11:54pm

According to the darktable 4.6 user manual, “very fast GPU is the preferred setting for systems with a GPU that strongly outperforms the CPU”.

How does one go about determining this? For example my system uses the i7-8700 and RTX 2060.

g-man · December 8, 2023, 12:45am

Same steps with -d perf. Turn off opencl, change some settings in modules to force some reprocessing and then export. Set to default, do the same, set to very fast and do the same.

Then go to the log and compare the [preview] [full] and [export].

priort · December 8, 2023, 12:55am

If you browse here and select DT then you can assign high performance to it… It may not have any impact but if the OS does throttle things down esp with laptops then this would not allow that and would give the max performance possible if all other settings in the nvidia config is optimum and within your software… At least this is how I understand this to work… this combined with the power profile set to ultimate will ensure the os doesn’t put the brakes on … if that never actually happens with DT then I guess it won’t make much difference… but setting it to the max means you know that it doesn’t…

priort · December 8, 2023, 1:01am

That card should be fast enough that it will always be the better option so using the fast GPU setting to set it as the preferred choice in all cases should be the most performant…

I have a 3060Ti… I did a lot of optimization runs but this was about a year ago and the code might have changed a bit… One thing I did try was the micronap setting… by default it is 250. I tried it at 0. It didn’t introduce any crashes or funny business in my case and I got the biggest bump among all the opencl setting from making that change… You could try that for a boost. Its easy to reverse if you suspect any negative effects…

stuntflyer · December 8, 2023, 3:59am

@priort The only opencl entries I see are

opencl=TRUE
opencl_checksum=3975488120
opencl_device_priority=/!0,///!0,* (I’m not sure if this is the right setting. I never changed it)
opencl_library=
opencl_mandatory_timeout=400
opencl_scheduling_profile=very fast GPU
opencl_tune_headroom=FALSE

priort · December 8, 2023, 4:33am

no there are options you can set in your darktablerc file

[darktable 4.6 user manual - memory & performance tuning)table]

For example this is mine…

cldevice_v5_nvidiacudanvidiageforcertx3060ti=0 0 0 16 16 1024 1 0 0.000 0.000 0.250

Jens-Hanno_Schwalm · December 8, 2023, 9:22am

@obe and as some background info why setting memory to unlimited or maximum with no headroom is “begging for trouble”

The docs are not correct atm, I think we will have all correct in about a week or so.
Let’s assume you have a 4gb card and let it use all memory. That will work in most cases if working in darkroom due to low memory requirements. Everyone is happy and thinks “whow, cool, got it”. Now export that image with high quality. The memory requirements “explode” as we render at full resolution and not downscaled as in darkroom. Now your device has to handle that data. But as the requirements are high it has to tile and does so by using the size of CL memory that is reserved for darktable or no tiling is required but for sure in your case it uses all. But your OS or Firefox are using graphics and memory too. Bang - darktable allocates graphics memory it won’t get and the code won’t work. So it has to abort the opencl code and do a fallback to cpu code. Here A) the aborting takes time and B) you use the slow Cpu!

So the lesson would be: Never try to use more memory than being safe. Some years ago we were safe with a safety margin of 400MB, this is not true any more being on windows or Linux as both the OS and applications use more graphics memory.

kofa · December 8, 2023, 12:04pm

That param has sections for each of darktable’s pipelines, separated by /. The pipelines, in order, are:

image (the central view of the darkroom),
preview (I assume on the lighttable, and maybe also the navigation view in the darkroom?)
export
thumbnail (on lighttable, and also on the filmstrip)
preview2 (a second preview can be shown on a 2nd display, for multi-monitor setups)

!0 means ‘any device but the one with ID = 0’
* means ‘any device at all’
+0 would mean ‘mandatorily on a GPU’; then you list the device IDs, in the example case only ID = 0. If all listed devices are busy processing something else, wait (block) until one becomes free, or until the opencl_mandatory_timeout elapses, then fall back to CPU. The timeout is measured in units of 5 ms (don’t ask me why), so 200 would mean 1 second.