OpenCL analysis... Darktable... much faster with Opencl disabled...something wrong??

kofa · December 10, 2021, 10:52pm

The complete set is:

opencl=TRUE
opencl_async_pixelpipe=true
opencl_avoid_atomics=false
opencl_checksum=3316763402
opencl_device_priority=+*/+*/+*/+*/+*
opencl_disable_drivers_blacklist=false
opencl_library=
opencl_mandatory_timeout=200
opencl_memory_headroom=800
opencl_memory_requirement=768
opencl_micro_nap=0
opencl_number_event_handles=1000
opencl_scheduling_profile=very fast GPU
opencl_size_roundup=16
opencl_synch_cache=active module
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false

Descriptions:

opencl_async_pixelpipe: if set to TRUE OpenCL pixelpipe will not be synchronized on a per-module basis. this can improve pixelpipe latency. however, potential OpenCL errors would be detected late; in such a case the complete pixelpipe needs to be reprocessed instead of only a single module. export pixelpipe will always be run synchronously.
opencl_avoid_atomics: if set to TRUE darktable will not use OpenCL kernels which contain atomic operations (example bilateral). pixelpipe processing will be done on CPU for the affected modules. useful if your OpenCL implementation freezes/crashes on atomics or if they are processed with a bad performance.
opencl_checksum: darktable re-checks the performance benchmarks of your system in case your setup has changed, which is indicated by a change versus the stored checksum in this config variable; darktable de-activates opencl if the GPU benchmark lies below the one of the CPU; initial value is the empty string; set to OFF if you want to deactivate any automatic checks and prefer to do all configurations manually.
opencl_device_priority: (ignored because of opencl_scheduling_profile=very fast GPU) defines priorities on how (multiple) OpenCL devices are allocated to the different types of pixelpipe (full, preview, export, thumbnail, preview2). for more details visit our usermanual (needs a restart).
opencl_disable_drivers_blacklist: not all opencl implementations are fully-functional, there are at least two, which are not production-quality yet: pocl and beignet; therefore at this point in time, it is better to explicitly blacklist them by default. can be changed by setting this configuration option to TRUE (needs a restart).
opencl_library: OpenCL runtime library is normally detected automatically by darktable. if your OpenCL runtime is at an unusual place and cannot be detected, enter the full pathname here. leave empty for default behavior.
opencl_mandatory_timeout: time period (in units of 5ms) after which we give up try-locking an opencl device for mandatory use. defaults to 200.
opencl_memory_headroom: this amount of memory (in MB) will be subtracted from total GPU memory in order to calculate the available OpenCL memory. too low values will lead to out-of-memory situations in OpenCL processing. too high values will lead to unnecessary tiling (needs a restart).
opencl_memory_requirement: OpenCL will only be activated if your graphics card has at least this amount of memory. reducing the value will allow cards with less GPU memory to be used - but at the risk of lower system stability and occasional crashes. values below 200 will be treated as 200.
opencl_micro_nap: for slow GPUs this gives your graphics driver some time to breathe to do needed screen updates. can be left at zero for fast devices.
opencl_number_event_handles: a positive non-zero integer defines the number of event handles that darktable may have opened on a device. a value of -1 does not pose any restrictions, bearing the risk of hitting the device’s resource limits. a value of zero completely prevents the use of event handles.
opencl_scheduling_profile: defines how preview and full pixelpipe tasks are scheduled on OpenCL enabled systems. default - GPU processes full and CPU processes preview pipe (adaptable by config parameters); multiple GPUs - process both pixelpipes in parallel on two different GPUs; very fast GPU - process both pixelpipes sequentially on the GPU.
opencl_size_roundup: in OpenCL processing round width/height of global work groups to a multiple of this value. reasonable values are powers of 2. this parameter can have high impact on OpenCL performance.
opencl_synch_cache: active module (default) - cache the input to the currently focused module, which allows for faster response time when making multiple adjustments to that module (though the whole pipeline may need to be reprocessed when another module is changed); true - cache the output after each module, which may improve speed, as the whole pixelpipe won’t be reprocessed on every parameter change, though will require more memory transfers from the GPU; false - do not sync the pixelpipe cache from OpenCL, which avoids memory transfers from GPUs fast enough to smoothly reprocess the whole pixelpipe.
opencl_use_cpu_devices: typically darktable’s hand-optimized CPU code is much faster than any OpenCL-on-CPU code; therefore CPUs are excluded from being used as OpenCL devices by default. can be changed by setting this configuration option to TRUE (needs a restart).
opencl_use_pinned_memory: during tiling huge amounts of memory need to be transferred between host and device. for some OpenCL implementations direct memory transfers give a drastic performance penalty. this can often be avoided by using indirect transfers via pinned memory. other devices have more efficient direct memory transfer implementations. AMD seems to belong to the first group, nvidia to the second.

Non-default values in my config:

opencl_async_pixelpipe=true (instead of false) - in order to improve latency; I see no issues mentioned in the description
opencl_memory_headroom=800 (instead of 400) - for me, OpenCL was not working reliably with less
opencl_micro_nap=0 (instead of 1000) - since the docs say ‘can be left at zero for fast devices’
opencl_number_event_handles=1000 (instead of 25) - more is probably better, I guess I see no issues that would indicate ‘hitting the device’s resource limits’
opencl_scheduling_profile=very fast GPU (instead of default, but darktable tries to find out the correct value by doing a very simple measurement at start-up, if this value is not present in the config – the measurement is not very reliable, though: