OpenCL is mandatory, yet CPU is used (solved)

For testing purposes, knowing that diffuse or sharpen is computation heavy, I set some rather extreme values to make sure it’d take a while. NVidia 1060 + Ryzen 5 5600X.

I wanted to force darktable to use the GPU whenever possible. Lines from the log at start-up (some lines omitted for brevity):

0.058074 [opencl_init] opencl_scheduling_profile: 'default'
0.058084 [opencl_init] opencl_device_priority: '+0/+0/+0/+0/+0'
0.086857 [opencl_init] found 1 device
[opencl_init] device 0: NVIDIA GeForce GTX 1060 6GB 
... lots of omitted lines...
0.193971 [opencl_priorities] these are your device priorities:
0.193972 [opencl_priorities]            image   preview export  thumbs  preview2
0.193975 [opencl_priorities]            0       0       0       0       0
0.193977 [opencl_priorities] show if opencl use is mandatory for a given pixelpipe:
0.193978 [opencl_priorities]            image   preview export  thumbs  preview2
0.193980 [opencl_priorities]            1       1       1       1       1
0.193986 [opencl_synchronization_timeout] synchronization timeout set to 200

OpenCL config from darktablerc:

opencl=TRUE
opencl_async_pixelpipe=true
opencl_avoid_atomics=false
opencl_checksum=3316763402
opencl_device_priority=+0/+0/+0/+0/+0
opencl_disable_drivers_blacklist=false
opencl_library=
opencl_mandatory_timeout=200
opencl_memory_headroom=800
opencl_memory_requirement=768
opencl_micro_nap=0
opencl_number_event_handles=1000
opencl_scheduling_profile=default
opencl_size_roundup=16
opencl_synch_cache=active module
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false

Despite these settings and mandatory GPU processing for everything, the CPU is used for the full pipeline; the GPU is only used for the preview:

248.904512 [dev] took 0.000 secs (0.000 CPU) to load the image.
248.926537 [pixelpipe_process] [preview] using device 0
249.961895 [pixelpipe_process] [full] using device -1
250.138825 [histogram] took 0.003 secs (0.047 CPU) scope draw
257.417738 [dev_pixelpipe] took 8.491 secs (61.711 CPU) processed `diffuse or sharpen' on GPU, blended on GPU [preview]
257.419696 [dev_pixelpipe] took 0.002 secs (0.002 CPU) processed `color balance rgb' on GPU, blended on GPU [preview]
257.489168 [dev_pixelpipe] took 0.069 secs (0.587 CPU) processed `filmic rgb' on GPU, blended on GPU [preview]
image colorspace transform RGB-->Lab took 0.002 secs (0.027 CPU) [colorout ]
257.523707 [dev_pixelpipe] took 0.035 secs (0.364 CPU) processed `output color profile' on CPU, blended on CPU [preview]
257.526662 [dev_pixelpipe] took 0.003 secs (0.026 CPU) processed `display encoding' on CPU, blended on CPU [preview]
image colorspace transform RGB-->RGB took 0.050 secs (0.542 lcms2) [final histogram]
257.592694 [histogram] took 0.066 secs (0.669 CPU) final rgb parade
257.592825 [opencl_profiling] profiling device 0 ('NVIDIA GeForce GTX 1060 6GB'):
257.592831 [opencl_profiling] spent  0.0027 seconds in [Write Image (from host to device)]
257.592836 [opencl_profiling] spent  3.4426 seconds in diffuse_blur_bspline
257.592839 [opencl_profiling] spent  4.9918 seconds in diffuse_pde
257.592842 [opencl_profiling] spent  0.0006 seconds in blendop_mask_rgb_jzczhz
257.592846 [opencl_profiling] spent  0.0003 seconds in [Copy Image (on device)]
257.592848 [opencl_profiling] spent  0.0004 seconds in blendop_rgb_jzczhz
257.592850 [opencl_profiling] spent  0.0092 seconds in [Read Image (from device to host)]
257.592852 [opencl_profiling] spent  0.0004 seconds in colorbalancergb
257.592854 [opencl_profiling] spent  0.0002 seconds in filmic_mask_clipped_pixels
257.592856 [opencl_profiling] spent  0.0003 seconds in filmic_inpaint_noise
257.592858 [opencl_profiling] spent  0.0009 seconds in init_reconstruct
257.592860 [opencl_profiling] spent  0.0114 seconds in blur_2D_Bspline_horizontal
257.592861 [opencl_profiling] spent  0.0165 seconds in blur_2D_Bspline_vertical
257.592863 [opencl_profiling] spent  0.0112 seconds in wavelets_detail_level
257.592865 [opencl_profiling] spent  0.0146 seconds in wavelets_reconstruct
257.592867 [opencl_profiling] spent  0.0006 seconds in compute_ratios
257.592869 [opencl_profiling] spent  0.0008 seconds in restore_ratios
257.592871 [opencl_profiling] spent  0.0003 seconds in filmicrgb_chroma
257.592873 [opencl_profiling] spent  8.5049 seconds totally in command queue (with 0 events missing)
257.593761 [dev_process_preview] pixel pipeline processing took 8.689 secs (63.394 CPU)
257.613740 [histogram] took 0.003 secs (0.043 CPU) scope draw
280.329631 [dev_pixelpipe] took 30.368 secs (311.416 CPU) processed `diffuse or sharpen' on CPU, blended on CPU [full]
280.376793 [dev_pixelpipe] took 0.047 secs (0.530 CPU) processed `color balance rgb' on CPU, blended on CPU [full]
280.555335 [dev_pixelpipe] took 0.179 secs (1.640 CPU) processed `filmic rgb' on CPU, blended on CPU [full]
image colorspace transform RGB-->Lab took 0.001 secs (0.009 CPU) [colorout ]
280.580039 [dev_pixelpipe] took 0.025 secs (0.202 CPU) processed `output color profile' on CPU, blended on CPU [full]
280.582331 [dev_pixelpipe] took 0.002 secs (0.017 CPU) processed `display encoding' on CPU, blended on CPU [full]
280.582731 [dev_process_image] pixel pipeline processing took 31.633 secs (314.555 CPU)

Do I misunderstand the configuration?
The documentation says (darktable 3.8 user manual - multiple devices):

If a pixelpipe process is about to be started and all GPUs in the corresponding group are busy, darktable automatically processes the image on the CPU by default. You can enforce GPU processing by prefixing the list of allowed GPUs with a plus sign +. In this case darktable will not use the CPU but rather suspend processing until the next permitted OpenCL device is available.

Therefore, I’d expect all processing to be done on the GPU.

darktable 3.9.0+69~gb72c7d7ba, Release build, Linux.

Does it change at all if you set the scheduling to fast??

It’s rather weird. With very fast GPU, when I open the image in the darkroom mode (I deleted ~/.cache/darktable before I started the program), I see:

99.358304 [dev] took 0.000 secs (0.000 CPU) to load the image.
99.821298 [histogram] took 0.000 secs (0.000 CPU) scope draw
99.832309 [dev] took 0.000 secs (0.000 CPU) to load the image.
99.853368 [histogram] took 0.000 secs (0.000 CPU) scope draw
99.883821 [pixelpipe_process] [full] using device 0
...
107.727294 [opencl_profiling] profiling device 0 ('NVIDIA GeForce GTX 1060 6GB'):
...
107.727356 [opencl_profiling] spent  7.6378 seconds totally in command queue (with 0 events missing)
107.728130 [dev_process_image] pixel pipeline processing took 7.844 secs (6.122 CPU)
107.750969 [histogram] took 0.000 secs (0.000 CPU) scope draw
107.787121 [pixelpipe_process] [preview] using device 0
...
116.489242 [opencl_profiling] profiling device 0 ('NVIDIA GeForce GTX 1060 6GB'):
...
116.489305 [opencl_profiling] spent  8.4819 seconds totally in command queue (with 0 events missing)
116.490394 [dev_process_preview] pixel pipeline processing took 8.736 secs (7.363 CPU)

But then, when I modify the iteration count of diffuse or sharpen to trigger processing, I again get preview on GPU, full on CPU:

152.278399 [pixelpipe_process] [full] using device 0
153.315040 [pixelpipe_process] [preview] using device -1
...
160.099152 [opencl_profiling] profiling device 0 ('NVIDIA GeForce GTX 1060 6GB'):
...
160.099316 [opencl_profiling] spent  7.7149 seconds totally in command queue (with 0 events missing)
160.102171 [dev_process_image] pixel pipeline processing took 7.824 secs (56.593 CPU)
188.951411 [dev_pixelpipe] took 35.636 secs (371.242 CPU) processed `diffuse or sharpen' on CPU, blended on CPU [preview]
189.025947 [dev_pixelpipe] took 0.075 secs (0.597 CPU) processed `color balance rgb' on CPU, blended on CPU [preview]
189.254420 [dev_pixelpipe] took 0.228 secs (2.274 CPU) processed `filmic rgb' on CPU, blended on CPU [preview]
image colorspace transform RGB-->Lab took 0.001 secs (0.011 CPU) [colorout ]
189.275863 [dev_pixelpipe] took 0.021 secs (0.221 CPU) processed `output color profile' on CPU, blended on CPU [preview]
189.278469 [dev_pixelpipe] took 0.003 secs (0.018 CPU) processed `display encoding' on CPU, blended on CPU [preview]
image colorspace transform RGB-->RGB took 0.028 secs (0.314 lcms2) [final histogram]
189.311976 [histogram] took 0.033 secs (0.343 CPU) final rgb parade
189.312462 [dev_process_preview] pixel pipeline processing took 37.057 secs (375.558 CPU)

OK, I think it simply gives up after opencl_mandatory_timeout.

The sleep time is 5 ms (usec = 5000), my opencl_mandatory_timeout is 200, so it gives up after 1 second, if I understand correctly.

I could have seen it here:

Note the 1 second gap between the two.

Raising the loop count (opencl_mandatory_timeout) so it allows up to 100 s wait results in GPU-only processing:

47.088374 [pixelpipe_process] [full] using device 0
...
54.892174 [opencl_profiling] profiling device 0 ('NVIDIA GeForce GTX 1060 6GB'):
54.892212 [opencl_profiling] spent  7.7388 seconds totally in command queue (with 0 events missing)
54.892535 [dev_process_image] pixel pipeline processing took 7.804 secs (5.978 CPU)
54.893041 [pixelpipe_process] [preview] using device 0
...
63.725147 [opencl_profiling] spent  8.7267 seconds totally in command queue (with 0 events missing)
63.725630 [dev_process_preview] pixel pipeline processing took 16.661 secs (13.020 CPU)
63.750731 [histogram] took 0.004 secs (0.004 CPU) scope draw

but it says

so device 0 is the GPU here? Though I thought it was always the CPU

Nope, the CPU is -1. See:

0.086857 [opencl_init] found 1 device
[opencl_init] device 0: NVIDIA GeForce GTX 1060 6GB 

Yeah my bad, you’re right. Should have thought before posting!

I should have read what I had copy-pasted here :smiley::

1 Like

As usual I’m a little slow… :)… @kofa just to clarify…so is the solution from you test settings that you are

1 going back to fast GPU
2 bumping the time out higher…have you left it at 100s ?
3 Back to this opencl_device_priority=+/+/+/+/+* from your test of opencl_device_priority=+0/+0/+0/+0/+0 ??
4 Likely your real solution and I missed it… :slight_smile:

yes, you force darktable (just possible in default mode) just to use GPU - if your GPU is blocked by preview pipe, then the full pipe is processed in parallel on the cpu because there’re no free GPU resources.

if you want to process preview and full pipe in sequence, then you need the fast GPU setting (but this ignores your priorities)
Maybe you can suppress cpu processing via !-1 - but i doubt this is a good idea :wink:

@priort, @MStraeten

opencl_scheduling_profile can be set to either very fast GPU or default. If it’s set to very fast GPU, priorities will be set as follows:

If set to default, it’ll honour whatever is specified using in opencl_device_priority:

So one piece of the puzzle is either to use opencl_scheduling_profile=very fast GPU or opencl_scheduling_profile=default. The latter also requires that one specify a setting for opencl_device_priority that forces processing on the GPU. I used opencl_device_priority=+0/+0/+0/+0/+0 (use the 1st device for everything – I have a single GPU), but the one forced by very fast GPU, +*/+*/+*/+*/+* (force the use of any GPU device) is also a good option (especially if one has multiple high-performance GPUs).

The third piece is opencl_mandatory_timeout. No matter what profile is used, the timeout logic applies: if for one of the pipelines the GPU is mandatory (+ sign in the priority), darktable will wait for a GPU to become available, but that wait is limited by opencl_mandatory_timeout (which sets the number of attempts, each attempt meaning a 5-ms wait). Setting very fast GPU also sets pixelpipe_synchronization_timeout, but it does not set anything that affects the wait-then-fallback-to-CPU logic:

So the simplest settings for my case are:

opencl_scheduling_profile=very fast GPU
opencl_mandatory_timeout=20000

With those, darktable will wait up to 100 s; opencl_device_priority specified in the file is ignored and overridden to +*/+*/+*/+*/+* (mandatory processing on any available GPU).

I have a Ryzen 5700G with Nvidea 3060, but even I turn off OpenCL (CPU only), I dont see the processing time you are seeing. I see the diffuse module running at 30.368 secs on your CPU. I think we have a similar CPU, so we should have similar results.

This is an image 5344 x 3462 pixels that I used some of the heavy processing modules with OpenCL off.
7.570198 [dev_pixelpipe] took 0.023 secs (0.219 CPU) processed `diffuse or sharpen’ on CPU, blended on CPU [full]
The whole image took:
7.743952 [dev_process_image] pixel pipeline processing took 0.531 secs (6.219 CPU)

With OpenCL on, it is faster, but we do have different cards:
16.392442 [dev_pixelpipe] took 0.012 secs (0.016 CPU) processed `diffuse or sharpen’ on GPU, blended on GPU [full]
16.438324 [dev_process_image] pixel pipeline processing took 0.309 secs (0.703 CPU)

Most of your other modules look in similar duration to mine (and I’m on Windows vs Linux). Why would this one (diffuse) take so long?

#2 - I do think the term “fast GPU” could be confusing. The speed of the GPU is not as important as the number of parallel computations it can perform. A card with more processing cores will benefit darktable more when it is doing all the processing the modules are asking, more so when multiple modules are being asked to process in parellel. To increase your overall image processing time, I would let the system use your CPU at least for the preview processing.

Well, for this test, I used 300 iterations or more, but with a 100 Mpixel photo, I don’t need to do that. See https://github.com/darktable-org/darktable/issues/10828

1 Like

oh! 300 iterations! I can see the need to increase the default from 200ms to something higher in a 100Mpixel photo. I quick check at my log, the highest I saw was a 100ms (demosaic). What would be a practical number (eg. not 300 iterations) for your 100Mpixel images?

They are not my images, if you check the issue. But with such an image, the simulate line drawing preset triggered this behaviour.