I edit (just opening an image in darkroom applying the existing history stack) an 6036 x 4022 image (24 mega pixels) with/without OpenCl and with/without changed schedule and processing memory settings.
The history stack applies denoise (profiled), exposure, tone eql, color calibration, diffuse and sharpen, contrast eql, color balance rgb, filmic rgb, local contrast
My pc is:
Windows 10, Darktable 4.4.2
PC: Intel® Core™ i7-4510U CPU @ 2.00GHz 2.60 GHz with 8,00 GB RAM
GPU 0: NVIDIA GeForce GTX 850M
GPU 1: Intel(R) HD Graphics 4400
You can find performance statistics here: darktable
Observations and questions:
GPU 1 is only used if GPU 0 is excluded from use as in the default opencl_device_priority setting. In all other cases, only GPU 0 is used even if I select multiple GPUs. Why is that?
How can you determine from the statistics whether processes are running in parallel? I would think that using the default device_priority setting GPU 0 and GPU 1 should run in parallel processing the center image and the preview.
The documentation states that editing a 20 megapixels image darktable requires at least 4GB of physical RAM plus 4 to 8GB of additional swap space to run but it will perform better the more memory you have. But apparently darktable uses much less RAM so setting the resources to “large” has no effect.
If you’re using multiple GPU with quite different performance you’d better do a user specific prioritization: e.g. dedicated GPU for Full and export pipe, on CPU-GPU for preview.
Upcoming 4.6 will have some optimisation regarding opencl use, so maybe you can check the weekly builds provided by ‚darktable windows insider program‘ posts…
What do you think about even trying to use integrated graphics on a machine with only 8GB of ram?? Maybe if there is no other GPU but I would think that diverting it to support integrated graphics might actually cause a hit on system performance no??
Well looking at all the files the user posted the fastest result was when only the NVIDIA was used…Sure that is a good check ( what you suggest) for if its even an issue for 8 GB ie use or not use integrated graphics when you only have 8gb. I think reports on this with marginal systems have ranged from only moderately faster to actually slower depending on the modules used but in the system with both it seems like it is better not to use it…
What’s would be the use of more memory when darktable doesn’t use the available memory and no tiling is taking place? See selected statistics from “openencl_veryfastgpu”:
0.0132 [memory] at startup
0.0136 [memory] max address space (vmpeak): 45892 kB
[memory] cur address space (vmsize): 43248 kB
[memory] max used memory (vmhwm ): 18768 kB
[memory] cur used memory (vmrss ): 18764 Kb
0,5132 [dt_get_sysresource_level] switched to 2 as `large'
0,5133 total mem: 8122MB
0,5134 mipmap cache: 1015MB
DEVICE: 0: ‘NVIDIA GeForce GTX 850M’
PLATFORM NAME & VENDOR: NVIDIA CUDA, NVIDIA Corporation
CANONICAL NAME: nvidiacudanvidiageforcegtx850m
DRIVER VERSION: 472.19
DEVICE VERSION: OpenCL 3.0 CUDA, SM_20 SUPPORT
DEVICE_TYPE: GPU GLOBAL MEM SIZE: 4096 MB MAX MEM ALLOC: 1024 MB
DEVICE: 2: ‘Intel(R) Core™ i7-4510U CPU @ 2.00GHz’
PLATFORM NAME & VENDOR: Intel(R) OpenCL, Intel(R) Corporation
CANONICAL NAME: intelropenclintelrcoretmi74510ucpu200ghz
DRIVER VERSION: 5.2.0.10094
DEVICE VERSION: OpenCL 1.2 (Build 10094)
DEVICE_TYPE: CPU GLOBAL MEM SIZE: 8122 MB MAX MEM ALLOC: 2031 MB
20,8709 [histogram] took 0,000 secs (0,000 CPU) scope draw
21,0493 [memory] before pixelpipe process
21,0493 [memory] max address space (vmpeak): 1869736 kB
[memory] cur address space (vmsize): 1748112 kB
[memory] max used memory (vmhwm ): 877276 kB
[memory] cur used memory (vmrss ): 586588 Kb
I’m not sure about all the nuances reported in the log but I suspect that in DT and overall you would have better system performance by completely disabling integrated graphics and giving the OS and your software that extra ram with out the overhead.
Your pipeline was actually slower marginally going from CPU only to default OPENCL that was using both your dedicated card and the integrated one… When the integrated one is bypassed then the time drops dramatically from 8 + seconds to 2.x seconds …
Despite what the memory used says and to be honest I didn’t look too closely it would seem and I am glad to be wrong that the overhead of managing your integrated graphic with your dedicated card produced a result that was ever so slightly worse than your cpu alone and substantially worse than when only the dedicated GPU was used…
If you did some resource monitoring beyond just the simple pipeline runs in DT you might be able to further gauge any potential benefits or pitfalls to disabling it…
Certain operations (e.g. mask feathering) require much RAM. Recently I managed to crash darktable with a 10-Megapixel image, when I set darktable to use “unlimited” GPU memory on a card with 6 GB RAM.
In general, i fully agree with statements from @kofa and @priort , if you have a potent-enough dedicated graphics card like the nvidia it probably means your system will be faster if you disable the intel opencl and go with “very fast graphics card” or use default setting. This way you will alow dt to use system memory for more important things than for a pretty slow driver/card.
No tiling is taking place according to the - d tiling setting. See statistics “opencl_veryfastgpu_checktiling”. (I switched off this option since no tiling was reported).
Darktable has apparently been tuned a lot. Some releases ago I often experienced tiling.
Darktable only uses GPU 1, regardless of my settings, if GPU 0 is excluded in opencl_device_priority. Very clever!
I don’t understand that the peak memory usage of the NVDIA is only 284 MB when the memory size is 4 GB.
I think that performance on my pc is ok except when using diffuse & sharpen……
If there is no tiling, then more memory is not going to help. More processing units in the GPU could help, but that means a physical change to your system.
Try updating the drivers, try the release candidate of dt 4.6, then try posting a -d perf.
That’s what I suspected. So Darktable is using less RAM than stated in the documentation. Great!
Device manager tells me that I have the latest NVIDIA driver dated 23.09.2021.
I can find a driver on NVIDIA.com dated 14.11.2023 but my HP is not mentioned under “supported products” and I don’t know how to revert to my present driver if the new driver, OpenCL and my PC turns out to be incompatible.
The memory used while developing in darkroom is much lower than while exporting.
The amount of required CL memory depends a lot on settings in the specific module. In d&s we have rhe iterations taking time and the overlapping forcing the system to tile.
There are no significant changes in dt opencl code that reduce memory consumption, in fact some modules now take even more in 4.6 as there were bugs…
Maybe for the given image and stack no more RAM was needed?
If there was no tiling, then the reported max. usage was probably all that was needed.
Could you share the image and the sidecar? I also have an Nvidia card, so I can run a test for you on the latest master branch with the latest Nvidia drivers (under Linux, but the results should be quite similar).
Edit: I’ve edited this response extensively, as I first used a build of darktable from master, compiled 2-3 days ago, and the CPU path was very slow (the export took 5 minutes). Then I built 4.4.2, and the CPU path was much faster, about half a minute; then I re-built master (after updating from git), and it was about the same speed as 4.4.2. The ‘master’ numbers below are from this latest build. (end of note)
Imported the image, opened it in the darkroom, exported as full-resolution JPG. Darktable resource level set to ‘large’.
19.3503 [default_process_tiling_cl_ptp] [export] **** tiling module 'diffuse' for image with size 6036x4020 --> 6036x4020
19.3503 [default_process_tiling_cl_ptp] [export] (3x1) tiles with max dimensions 5000x4020, pinned=OFF, good 2952x1972 and overlap 1024
19.3503 [default_process_tiling_cl_ptp] [export] tile (0,0) size 5000x4020 at origin [0,0]
...
27.1754 [dev_process_export] pixel pipeline processing took 12.428 secs (15.561 CPU)
...
[opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): peak memory usage 5230003200 bytes (4987.7 MB)
[opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): 1493 out of 1493 events were successful and 0 events lost. max event=984
Without OpenCL, the export took 28 seconds:
46.6293 [dev_process_export] pixel pipeline processing took 28.165 secs (282.466 CPU)
With the same flags and 4.4.2 (with a clean, default config created by darktable 4.4.2 right now):
48.4499 [default_process_tiling_cl_ptp] [export] tile (1,0) size 1316x4020 at origin [4720,0]
[...]
48.9268 [dev_process_export] pixel pipeline processing took 13.459 secs (17.917 CPU)
[opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): peak memory usage 4063994980 bytes (3875.7 MB)
[opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' (0): 2002 out of 2002 events were successful and 0 events lost. max event=1440
@g-man : my timings were off, I’ve updated my post.
The time difference (12 seconds for me, <5 seconds for you) shows the power + memory differences between our cards, I think. I guess you don’t get tiling reported with -d tiling.