OpenCl multiple GPUs, memory and tuning

obe · November 22, 2023, 2:00pm

I edit (just opening an image in darkroom applying the existing history stack) an 6036 x 4022 image (24 mega pixels) with/without OpenCl and with/without changed schedule and processing memory settings.

The history stack applies denoise (profiled), exposure, tone eql, color calibration, diffuse and sharpen, contrast eql, color balance rgb, filmic rgb, local contrast

My pc is:
Windows 10, Darktable 4.4.2
PC: Intel® Core™ i7-4510U CPU @ 2.00GHz 2.60 GHz with 8,00 GB RAM
GPU 0: NVIDIA GeForce GTX 850M
GPU 1: Intel(R) HD Graphics 4400

You can find performance statistics here: darktable

Observations and questions:

GPU 1 is only used if GPU 0 is excluded from use as in the default opencl_device_priority setting. In all other cases, only GPU 0 is used even if I select multiple GPUs. Why is that?

How can you determine from the statistics whether processes are running in parallel? I would think that using the default device_priority setting GPU 0 and GPU 1 should run in parallel processing the center image and the preview.

The documentation states that editing a 20 megapixels image darktable requires at least 4GB of physical RAM plus 4 to 8GB of additional swap space to run but it will perform better the more memory you have. But apparently darktable uses much less RAM so setting the resources to “large” has no effect.

g-man · November 22, 2023, 3:00pm

The Nvidia card has enough memory but your drivers are very old. Go to Nvidia website and download the latest for that card.

The other card barely has enough memory. It will force using tiling and make things slow.

MStraeten · November 22, 2023, 9:19pm

If you’re using multiple GPU with quite different performance you’d better do a user specific prioritization: e.g. dedicated GPU for Full and export pipe, on CPU-GPU for preview.
Upcoming 4.6 will have some optimisation regarding opencl use, so maybe you can check the weekly builds provided by ‚darktable windows insider program‘ posts…

priort · November 22, 2023, 11:14pm

What do you think about even trying to use integrated graphics on a machine with only 8GB of ram?? Maybe if there is no other GPU but I would think that diverting it to support integrated graphics might actually cause a hit on system performance no??

g-man · November 22, 2023, 11:45pm

There is no need to think about this… Export an image with opencl disable using -d perf… And then export using the integrated GPU… Compare the results

priort · November 22, 2023, 11:49pm

Well looking at all the files the user posted the fastest result was when only the NVIDIA was used…Sure that is a good check ( what you suggest) for if its even an issue for 8 GB ie use or not use integrated graphics when you only have 8gb. I think reports on this with marginal systems have ranged from only moderately faster to actually slower depending on the modules used but in the system with both it seems like it is better not to use it…

obe · November 23, 2023, 9:16am

What’s would be the use of more memory when darktable doesn’t use the available memory and no tiling is taking place? See selected statistics from “openencl_veryfastgpu”:

 0.0132 [memory] at startup
 0.0136 [memory] max address space (vmpeak):        45892 kB
        [memory] cur address space (vmsize):        43248 kB
        [memory] max used memory   (vmhwm ):        18768 kB
        [memory] cur used memory   (vmrss ):        18764 Kb
 0,5132 [dt_get_sysresource_level] switched to 2 as `large'
 0,5133   total mem:       8122MB
 0,5134   mipmap cache:    1015MB

0,5134 available mem: 5552MB
0,5135 singlebuff: 126MB
0,5136 OpenCL tune mem: WANTED
0,5136 OpenCL pinned: WANTED

DEVICE: 0: ‘NVIDIA GeForce GTX 850M’
PLATFORM NAME & VENDOR: NVIDIA CUDA, NVIDIA Corporation
CANONICAL NAME: nvidiacudanvidiageforcegtx850m
DRIVER VERSION: 472.19
DEVICE VERSION: OpenCL 3.0 CUDA, SM_20 SUPPORT
DEVICE_TYPE: GPU
GLOBAL MEM SIZE: 4096 MB
MAX MEM ALLOC: 1024 MB

DEVICE: 2: ‘Intel(R) Core™ i7-4510U CPU @ 2.00GHz’
PLATFORM NAME & VENDOR: Intel(R) OpenCL, Intel(R) Corporation
CANONICAL NAME: intelropenclintelrcoretmi74510ucpu200ghz
DRIVER VERSION: 5.2.0.10094
DEVICE VERSION: OpenCL 1.2 (Build 10094)
DEVICE_TYPE: CPU
GLOBAL MEM SIZE: 8122 MB
MAX MEM ALLOC: 2031 MB

20,8709 [histogram] took 0,000 secs (0,000 CPU) scope draw
21,0493 [memory] before pixelpipe process
21,0493 [memory] max address space (vmpeak): 1869736 kB
[memory] cur address space (vmsize): 1748112 kB
[memory] max used memory (vmhwm ): 877276 kB
[memory] cur used memory (vmrss ): 586588 Kb

[opencl_summary_statistics] device ‘NVIDIA CUDA NVIDIA GeForce GTX 850M’ (0): peak memory usage 298135680 bytes (284,3 MB)

priort · November 23, 2023, 8:02pm

I’m not sure about all the nuances reported in the log but I suspect that in DT and overall you would have better system performance by completely disabling integrated graphics and giving the OS and your software that extra ram with out the overhead.

Your pipeline was actually slower marginally going from CPU only to default OPENCL that was using both your dedicated card and the integrated one… When the integrated one is bypassed then the time drops dramatically from 8 + seconds to 2.x seconds …

Despite what the memory used says and to be honest I didn’t look too closely it would seem and I am glad to be wrong that the overhead of managing your integrated graphic with your dedicated card produced a result that was ever so slightly worse than your cpu alone and substantially worse than when only the dedicated GPU was used…

If you did some resource monitoring beyond just the simple pipeline runs in DT you might be able to further gauge any potential benefits or pitfalls to disabling it…

kofa · November 23, 2023, 9:17pm

Certain operations (e.g. mask feathering) require much RAM. Recently I managed to crash darktable with a 10-Megapixel image, when I set darktable to use “unlimited” GPU memory on a card with 6 GB RAM.

hannoschwalm · November 23, 2023, 9:25pm

That’s why you should use unlimited

That does not show it didn’t use tiling.

In general, i fully agree with statements from @kofa and @priort , if you have a potent-enough dedicated graphics card like the nvidia it probably means your system will be faster if you disable the intel opencl and go with “very fast graphics card” or use default setting. This way you will alow dt to use system memory for more important things than for a pretty slow driver/card.

obe · November 24, 2023, 11:10am

No tiling is taking place according to the - d tiling setting. See statistics “opencl_veryfastgpu_checktiling”. (I switched off this option since no tiling was reported).

Darktable has apparently been tuned a lot. Some releases ago I often experienced tiling.

Darktable only uses GPU 1, regardless of my settings, if GPU 0 is excluded in opencl_device_priority. Very clever!

I don’t understand that the peak memory usage of the NVDIA is only 284 MB when the memory size is 4 GB.

I think that performance on my pc is ok except when using diffuse & sharpen……

g-man · November 24, 2023, 12:23pm

Did you update the driver like I mentioned?

If there is no tiling, then more memory is not going to help. More processing units in the GPU could help, but that means a physical change to your system.

Try updating the drivers, try the release candidate of dt 4.6, then try posting a -d perf.

obe · November 24, 2023, 1:48pm

That’s what I suspected. So Darktable is using less RAM than stated in the documentation. Great!

Device manager tells me that I have the latest NVIDIA driver dated 23.09.2021.
I can find a driver on NVIDIA.com dated 14.11.2023 but my HP is not mentioned under “supported products” and I don’t know how to revert to my present driver if the new driver, OpenCL and my PC turns out to be incompatible.

Jens-Hanno_Schwalm · November 24, 2023, 1:53pm

The memory used while developing in darkroom is much lower than while exporting.

The amount of required CL memory depends a lot on settings in the specific module. In d&s we have rhe iterations taking time and the overlapping forcing the system to tile.

There are no significant changes in dt opencl code that reduce memory consumption, in fact some modules now take even more in 4.6 as there were bugs…

kofa · November 24, 2023, 1:55pm

Maybe for the given image and stack no more RAM was needed?
If there was no tiling, then the reported max. usage was probably all that was needed.
Could you share the image and the sidecar? I also have an Nvidia card, so I can run a test for you on the latest master branch with the latest Nvidia drivers (under Linux, but the results should be quite similar).

g-man · November 24, 2023, 2:20pm

It is supported by windows. It will be fine. If you need to you can uninstall or rollback.

obe · November 24, 2023, 2:49pm

I have uploaded the image and the xmp. You can acces them following the link in my original post.

kofa · November 24, 2023, 3:17pm

Edit: I’ve edited this response extensively, as I first used a build of darktable from master, compiled 2-3 days ago, and the CPU path was very slow (the export took 5 minutes). Then I built 4.4.2, and the CPU path was much faster, about half a minute; then I re-built master (after updating from git), and it was about the same speed as 4.4.2. The ‘master’ numbers below are from this latest build.
(end of note)

Imported the image, opened it in the darkroom, exported as full-resolution JPG. Darktable resource level set to ‘large’.

    19.3503 [default_process_tiling_cl_ptp] [export] **** tiling module 'diffuse' for image with size 6036x4020 --> 6036x4020
    19.3503 [default_process_tiling_cl_ptp] [export] (3x1) tiles with max dimensions 5000x4020, pinned=OFF, good 2952x1972 and overlap 1024
    19.3503 [default_process_tiling_cl_ptp] [export] tile (0,0) size 5000x4020 at origin [0,0]
    ...
    27.1754 [dev_process_export] pixel pipeline processing took 12.428 secs (15.561 CPU)
    ...
 [opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): peak memory usage 5230003200 bytes (4987.7 MB)
 [opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): 1493 out of 1493 events were successful and 0 events lost. max event=984

Without OpenCL, the export took 28 seconds:

    46.6293 [dev_process_export] pixel pipeline processing took 28.165 secs (282.466 CPU)

With the same flags and 4.4.2 (with a clean, default config created by darktable 4.4.2 right now):

    48.4499 [default_process_tiling_cl_ptp] [export] tile (1,0) size 1316x4020 at origin [4720,0]
[...]
    48.9268 [dev_process_export] pixel pipeline processing took 13.459 secs (17.917 CPU)
 [opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): peak memory usage 4063994980 bytes (3875.7 MB)
 [opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): 2002 out of 2002 events were successful and 0 events lost. max event=1440

g-man · November 24, 2023, 4:15pm

Fedora 39 KDE, Nvidia 3060 12Gb
Current master, 4.5.0~git1277.c426ece8

Full res jpg export

    18.8576 [export] creating pixelpipe took 0.047 secs (0.066 CPU)
    18.8579 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [export]
    18.8645 [dev_pixelpipe] took 0.007 secs (0.004 CPU) [export] processed `rawprepare' on GPU, blended on GPU
    18.8668 [dev_pixelpipe] took 0.002 secs (0.002 CPU) [export] processed `temperature' on GPU, blended on GPU
    18.8699 [dev_pixelpipe] took 0.003 secs (0.001 CPU) [export] processed `highlights' on GPU, blended on GPU
    19.2351 [histogram] took 0.003 secs (0.041 CPU) scope draw
    19.3403 [resample_plain] took 0.053 secs (0.224 CPU) 1:1 copy/crop of 6036x4020 pixels
    19.3699 [dev_pixelpipe] took 0.500 secs (5.247 CPU) [export] processed `demosaic' on CPU, blended on CPU
    19.5809 [dev_pixelpipe] took 0.211 secs (0.193 CPU) [export] processed `denoiseprofile' on GPU, blended on GPU
    19.8466 [dev_pixelpipe] took 0.266 secs (1.043 CPU) [export] processed `lens' on GPU, blended on GPU
    19.8531 [dev_pixelpipe] took 0.006 secs (0.003 CPU) [export] processed `exposure' on GPU, blended on GPU
    20.1591 [dev_pixelpipe] took 0.306 secs (1.920 CPU) [export] processed `toneequal' on CPU, blended on CPU
    20.1990 [dev_pixelpipe] took 0.040 secs (0.038 CPU) [export] processed `colorin' on GPU, blended on GPU
    20.2045 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.005 secs (0.002 GPU) [channelmixerrgb]
    20.2177 [dev_pixelpipe] took 0.019 secs (0.008 CPU) [export] processed `channelmixerrgb' on GPU, blended on GPU
    23.3041 [dev_pixelpipe] took 3.086 secs (2.763 CPU) [export] processed `diffuse' on GPU, blended on GPU
    23.3104 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0.006 secs (0.002 GPU) [atrous]
    23.5188 [dev_pixelpipe] took 0.215 secs (0.168 CPU) [export] processed `atrous' on GPU, blended on GPU
    23.5251 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.006 secs (0.002 GPU) [colorbalancergb]
    23.5431 [dev_pixelpipe] took 0.024 secs (0.009 CPU) [export] processed `colorbalancergb' on GPU, blended on GPU
    23.5633 [dev_pixelpipe] took 0.020 secs (0.006 CPU) [export] processed `filmicrgb' on GPU, blended on GPU
    23.5687 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0.005 secs (0.003 GPU) [bilat]
    23.6755 [dev_pixelpipe] took 0.112 secs (0.021 CPU) [export] processed `bilat' on GPU, blended on GPU
    23.6873 [dev_pixelpipe] took 0.012 secs (0.008 CPU) [export] processed `colorout' on GPU, blended on GPU
    23.6894 [resample_cl] took 0.002 secs (0.001 CPU) 1:1 copy/crop of 6036x4020 pixels
    23.6948 [dev_pixelpipe] took 0.008 secs (0.005 CPU) [export] processed `finalscale' on GPU, blended on GPU
    23.7336 [dev_process_export] pixel pipeline processing took 4.876 secs (11.481 CPU)
    24.1868 [export_job] exported to `//home/gman/Pictures/NextCloud Photos/Play Raw Processed/DSC_5330.jpg'

That D&S is set to 10 iterations and it takes 3sec. I think 10 is too high for that D&S (dehaze).

kofa · November 24, 2023, 4:22pm

@g-man : my timings were off, I’ve updated my post.
The time difference (12 seconds for me, <5 seconds for you) shows the power + memory differences between our cards, I think. I guess you don’t get tiling reported with -d tiling.