OpenCL analysis... Darktable... much faster with Opencl disabled...something wrong??

Ya I had made that change after following that thread. I had just been using build.sh to build DT and I think it was set to create RelWithDebInfo version of DT…Did you tweak any of your other OPENCL settings other than the headroom…

The complete set is:

opencl=TRUE
opencl_async_pixelpipe=true
opencl_avoid_atomics=false
opencl_checksum=3316763402
opencl_device_priority=+*/+*/+*/+*/+*
opencl_disable_drivers_blacklist=false
opencl_library=
opencl_mandatory_timeout=200
opencl_memory_headroom=800
opencl_memory_requirement=768
opencl_micro_nap=0
opencl_number_event_handles=1000
opencl_scheduling_profile=very fast GPU
opencl_size_roundup=16
opencl_synch_cache=active module
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false

Descriptions:

  • opencl_async_pixelpipe: if set to TRUE OpenCL pixelpipe will not be synchronized on a per-module basis. this can improve pixelpipe latency. however, potential OpenCL errors would be detected late; in such a case the complete pixelpipe needs to be reprocessed instead of only a single module. export pixelpipe will always be run synchronously.
  • opencl_avoid_atomics: if set to TRUE darktable will not use OpenCL kernels which contain atomic operations (example bilateral). pixelpipe processing will be done on CPU for the affected modules. useful if your OpenCL implementation freezes/crashes on atomics or if they are processed with a bad performance.
  • opencl_checksum: darktable re-checks the performance benchmarks of your system in case your setup has changed, which is indicated by a change versus the stored checksum in this config variable; darktable de-activates opencl if the GPU benchmark lies below the one of the CPU; initial value is the empty string; set to OFF if you want to deactivate any automatic checks and prefer to do all configurations manually.
  • opencl_device_priority: (ignored because of opencl_scheduling_profile=very fast GPU) defines priorities on how (multiple) OpenCL devices are allocated to the different types of pixelpipe (full, preview, export, thumbnail, preview2). for more details visit our usermanual (needs a restart).
  • opencl_disable_drivers_blacklist: not all opencl implementations are fully-functional, there are at least two, which are not production-quality yet: pocl and beignet; therefore at this point in time, it is better to explicitly blacklist them by default. can be changed by setting this configuration option to TRUE (needs a restart).
  • opencl_library: OpenCL runtime library is normally detected automatically by darktable. if your OpenCL runtime is at an unusual place and cannot be detected, enter the full pathname here. leave empty for default behavior.
  • opencl_mandatory_timeout: time period (in units of 5ms) after which we give up try-locking an opencl device for mandatory use. defaults to 200.
  • opencl_memory_headroom: this amount of memory (in MB) will be subtracted from total GPU memory in order to calculate the available OpenCL memory. too low values will lead to out-of-memory situations in OpenCL processing. too high values will lead to unnecessary tiling (needs a restart).
  • opencl_memory_requirement: OpenCL will only be activated if your graphics card has at least this amount of memory. reducing the value will allow cards with less GPU memory to be used - but at the risk of lower system stability and occasional crashes. values below 200 will be treated as 200.
  • opencl_micro_nap: for slow GPUs this gives your graphics driver some time to breathe to do needed screen updates. can be left at zero for fast devices.
  • opencl_number_event_handles: a positive non-zero integer defines the number of event handles that darktable may have opened on a device. a value of -1 does not pose any restrictions, bearing the risk of hitting the device’s resource limits. a value of zero completely prevents the use of event handles.
  • opencl_scheduling_profile: defines how preview and full pixelpipe tasks are scheduled on OpenCL enabled systems. default - GPU processes full and CPU processes preview pipe (adaptable by config parameters); multiple GPUs - process both pixelpipes in parallel on two different GPUs; very fast GPU - process both pixelpipes sequentially on the GPU.
  • opencl_size_roundup: in OpenCL processing round width/height of global work groups to a multiple of this value. reasonable values are powers of 2. this parameter can have high impact on OpenCL performance.
  • opencl_synch_cache: active module (default) - cache the input to the currently focused module, which allows for faster response time when making multiple adjustments to that module (though the whole pipeline may need to be reprocessed when another module is changed); true - cache the output after each module, which may improve speed, as the whole pixelpipe won’t be reprocessed on every parameter change, though will require more memory transfers from the GPU; false - do not sync the pixelpipe cache from OpenCL, which avoids memory transfers from GPUs fast enough to smoothly reprocess the whole pixelpipe.
  • opencl_use_cpu_devices: typically darktable’s hand-optimized CPU code is much faster than any OpenCL-on-CPU code; therefore CPUs are excluded from being used as OpenCL devices by default. can be changed by setting this configuration option to TRUE (needs a restart).
  • opencl_use_pinned_memory: during tiling huge amounts of memory need to be transferred between host and device. for some OpenCL implementations direct memory transfers give a drastic performance penalty. this can often be avoided by using indirect transfers via pinned memory. other devices have more efficient direct memory transfer implementations. AMD seems to belong to the first group, nvidia to the second.

Non-default values in my config:

  • opencl_async_pixelpipe=true (instead of false) - in order to improve latency; I see no issues mentioned in the description
  • opencl_memory_headroom=800 (instead of 400) - for me, OpenCL was not working reliably with less
  • opencl_micro_nap=0 (instead of 1000) - since the docs say ‘can be left at zero for fast devices’
  • opencl_number_event_handles=1000 (instead of 25) - more is probably better, I guess :slight_smile: I see no issues that would indicate ‘hitting the device’s resource limits’
  • opencl_scheduling_profile=very fast GPU (instead of default, but darktable tries to find out the correct value by doing a very simple measurement at start-up, if this value is not present in the config – the measurement is not very reliable, though:
5 Likes

You are very kind to spell it all out …I’ll experiment with a few combinations…I wonder it if would ever be possible to write and optimization script…it seems like there could be a lot of hidden performance out there for some… Thanks again…

Following your settings I changed opencl_micro_nap=0 (instead of 1000)

which lead to a significant speedup:
[dev_pixelpipe] took 2,126 secs (0,312 CPU) processed `astrophoto denoise’ on GPU, blended on GPU [export]

and total:
[dev_process_export] pixel pipeline processing took 6,121 secs (8,047 CPU)
(Instead of 8,207 secs :grinning:)

I also changed opencl_number_event_handles=1000 (instead of 25)

[dev_pixelpipe] took 2,120 secs (0,391 CPU) processed `astrophoto denoise’ on GPU, blended on GPU [export]
[dev_process_export] pixel pipeline processing took 6,094 secs (8,266 CPU)

Not much improvement here, but also a bit better.

All the other parameters were already set like yours.

Thanks a lot !

Changing this from 400 to 800 gave a massive improvement on my PC.

The total has gone from around 23 secs (62 CPU) to 10 secs (16 CPU).

Trying 1200 made it slower at around 12 secs (20 CPU).

The other changes made little if any difference.

Thank you all for posting as this really helps.

The headroom is the amount which darktable has to subtract from the memory available on the card, because it’s potentially used by other apps. As far as I know, there is no standard way to ask how much free memory that card has. If the value is too low, darktable will run into problems when trying to allocate memory on the video card; if it’s too high, darktable will not use the memory optimally.

See also: Which benchmarks provide an estimate to enable me compare and decide on which GPU to buy, for Image Processing ? - #13 by kofa

1 Like

I have the following setup:

  • AMD Ryzen 7 1700 (8 cores)
  • AMD Radeon RX 5700 XT 8 GB
  • 32GB RAM
  • Ubuntu 20.4.3
  • amdgpu driver version 21.40.1.40501 (installed via amdgpu-install --opencl=rocr --vulkan=pro)
  • darktable built from master via plain build.sh command (version 3.7.0+1656~gb9fecd246)

with OpenCL:

12,347238 [dev] took 0,181 secs (0,191 CPU) to load the image.
12,464653 [export] creating pixelpipe took 0,112 secs (0,796 CPU)
12,464729 [pixelpipe_process] [export] using device 0
12,469442 [dev_pixelpipe] took 0,004 secs (0,012 CPU) initing base buffer [export]
12,484404 [dev_pixelpipe] took 0,015 secs (0,044 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
12,486257 [dev_pixelpipe] took 0,002 secs (0,001 CPU) processed `white balance' on GPU, blended on GPU [export]
12,618171 [dev_pixelpipe] took 0,132 secs (1,099 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
12,800475 [dev_pixelpipe] took 0,182 secs (2,410 CPU) processed `demosaic' on CPU, blended on CPU [export]
13,206668 [dev_pixelpipe] took 0,406 secs (4,031 CPU) processed `tone mapping' on CPU, blended on CPU [export]
13,468421 [dev_pixelpipe] took 0,262 secs (0,631 CPU) processed `lens correction' on GPU, blended on GPU [export]
13,474133 [dev_pixelpipe] took 0,006 secs (0,002 CPU) processed `base curve' on GPU, blended on GPU [export]
13,482080 [dev_pixelpipe] took 0,008 secs (0,001 CPU) processed `input color profile' on GPU, blended on GPU [export]
13,533536 [dev_pixelpipe] took 0,051 secs (0,000 CPU) processed `color reconstruction' on GPU, blended on GPU [export]
14,517677 [dev_pixelpipe] took 0,984 secs (0,055 CPU) processed `astrophoto denoise' on GPU, blended on GPU [export]
14,580533 [dev_pixelpipe] took 0,063 secs (0,000 CPU) processed `global tonemap' on GPU, blended on GPU [export]
14,608004 [dev_pixelpipe] took 0,027 secs (0,004 CPU) processed `shadows and highlights' on GPU, blended on GPU [export]
14,677770 [dev_pixelpipe] took 0,070 secs (0,002 CPU) processed `contrast equalizer' on GPU, blended on GPU [export]
14,759108 [dev_pixelpipe] took 0,081 secs (0,001 CPU) processed `local contrast' on GPU, blended on GPU [export]
14,765870 [dev_pixelpipe] took 0,007 secs (0,000 CPU) processed `color zones' on GPU, blended on GPU [export]
14,770281 [dev_pixelpipe] took 0,004 secs (0,001 CPU) processed `levels' on GPU, blended on GPU [export]
14,779443 [dev_pixelpipe] took 0,009 secs (0,001 CPU) processed `sharpen' on GPU, blended on GPU [export]
14,782962 [dev_pixelpipe] took 0,003 secs (0,001 CPU) processed `color contrast' on GPU, blended on GPU [export]
14,790542 [dev_pixelpipe] took 0,008 secs (0,000 CPU) processed `output color profile' on GPU, blended on GPU [export]
14,857466 [opencl_profiling] profiling device 0 ('gfx1010:xnack-'):
14,857499 [opencl_profiling] spent  0,0508 seconds in [Write Image (from host to device)]
14,857508 [opencl_profiling] spent  0,0004 seconds in rawprepare_1f
14,857513 [opencl_profiling] spent  0,0004 seconds in whitebalance_1f
14,857518 [opencl_profiling] spent  0,0607 seconds in [Read Image (from device to host)]
14,857522 [opencl_profiling] spent  0,0084 seconds in [Copy Image (on device)]
14,857527 [opencl_profiling] spent  0,0017 seconds in [Write Buffer (from host to device)]
14,857531 [opencl_profiling] spent  0,0093 seconds in lens_distort_bicubic
14,857535 [opencl_profiling] spent  0,0020 seconds in basecurve_legacy_lut
14,857539 [opencl_profiling] spent  0,0019 seconds in colorin_unbound
14,857550 [opencl_profiling] spent  0,0000 seconds in colorreconstruction_zero
14,857555 [opencl_profiling] spent  0,0477 seconds in colorreconstruction_splat
14,857559 [opencl_profiling] spent  0,0000 seconds in [Copy Buffer to Buffer (on device)]
14,857563 [opencl_profiling] spent  0,0000 seconds in colorreconstruction_blur_line
14,857567 [opencl_profiling] spent  0,0018 seconds in colorreconstruction_slice
14,857571 [opencl_profiling] spent  0,0009 seconds in nlmeans_init
14,857575 [opencl_profiling] spent  0,1251 seconds in nlmeans_dist
14,857579 [opencl_profiling] spent  0,0898 seconds in nlmeans_horiz
14,857583 [opencl_profiling] spent  0,2214 seconds in nlmeans_vert
14,857587 [opencl_profiling] spent  0,3587 seconds in nlmeans_accu
14,857591 [opencl_profiling] spent  0,0028 seconds in nlmeans_finish
14,857596 [opencl_profiling] spent  0,0014 seconds in pixelmax_first
14,857600 [opencl_profiling] spent  0,0000 seconds in pixelmax_second
14,857604 [opencl_profiling] spent  0,0000 seconds in [Read Buffer (from device to host)]
14,857608 [opencl_profiling] spent  0,0000 seconds in zero
14,857612 [opencl_profiling] spent  0,0691 seconds in splat
14,857616 [opencl_profiling] spent  0,0016 seconds in global_tonemap_drago
14,857620 [opencl_profiling] spent  0,0001 seconds in blur_line
14,857624 [opencl_profiling] spent  0,0000 seconds in blur_line_z
14,857628 [opencl_profiling] spent  0,0024 seconds in slice_to_output
14,857632 [opencl_profiling] spent  0,0019 seconds in [Copy Image to Buffer (on device)]
14,857636 [opencl_profiling] spent  0,0111 seconds in gaussian_column_4c
14,857640 [opencl_profiling] spent  0,0039 seconds in gaussian_transpose_4c
14,857643 [opencl_profiling] spent  0,0017 seconds in [Copy Buffer to Image (on device)]
14,857647 [opencl_profiling] spent  0,0029 seconds in shadows_highlights_mix
14,857651 [opencl_profiling] spent  0,0425 seconds in eaw_decompose
14,857655 [opencl_profiling] spent  0,0197 seconds in eaw_synthesize
14,857659 [opencl_profiling] spent  0,0016 seconds in slice
14,857663 [opencl_profiling] spent  0,0021 seconds in colorzones_v3
14,857667 [opencl_profiling] spent  0,0018 seconds in levels
14,857670 [opencl_profiling] spent  0,0021 seconds in sharpen_hblur
14,857674 [opencl_profiling] spent  0,0026 seconds in sharpen_vblur
14,857678 [opencl_profiling] spent  0,0025 seconds in sharpen_mix
14,857682 [opencl_profiling] spent  0,0019 seconds in colorcontrast
14,857686 [opencl_profiling] spent  0,0028 seconds in colorout
14,857690 [opencl_profiling] spent  1,1593 seconds totally in command queue (with 0 events missing)
14,857717 [dev_process_export] pixel pipeline processing took 2,393 secs (8,313 CPU)

without OpenCL:

0,838459 [dev] took 0,177 secs (0,194 CPU) to load the image.
0,950538 [export] creating pixelpipe took 0,107 secs (0,699 CPU)
0,950609 [pixelpipe_process] [export] using device -1
0,959993 [dev_pixelpipe] took 0,004 secs (0,041 CPU) initing base buffer [export]
0,975859 [dev_pixelpipe] took 0,016 secs (0,155 CPU) processed `raw black/white point' on CPU, blended on CPU [export]
0,984521 [dev_pixelpipe] took 0,008 secs (0,072 CPU) processed `white balance' on CPU, blended on CPU [export]
1,059209 [dev_pixelpipe] took 0,074 secs (1,073 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
1,253666 [dev_pixelpipe] took 0,194 secs (2,309 CPU) processed `demosaic' on CPU, blended on CPU [export]
1,668573 [dev_pixelpipe] took 0,415 secs (4,023 CPU) processed `tone mapping' on CPU, blended on CPU [export]
2,161536 [dev_pixelpipe] took 0,493 secs (5,837 CPU) processed `lens correction' on CPU, blended on CPU [export]
2,201157 [dev_pixelpipe] took 0,040 secs (0,543 CPU) processed `base curve' on CPU, blended on CPU [export]
2,228603 [dev_pixelpipe] took 0,027 secs (0,399 CPU) processed `input color profile' on CPU, blended on CPU [export]
2,375968 [dev_pixelpipe] took 0,147 secs (1,993 CPU) processed `color reconstruction' on CPU, blended on CPU [export]
5,527111 [dev_pixelpipe] took 3,151 secs (46,505 CPU) processed `astrophoto denoise' on CPU, blended on CPU [export]
5,681005 [dev_pixelpipe] took 0,154 secs (2,268 CPU) processed `global tonemap' on CPU, blended on CPU [export]
5,987418 [dev_pixelpipe] took 0,306 secs (4,148 CPU) processed `shadows and highlights' on CPU, blended on CPU [export]
8,218073 [dev_pixelpipe] took 2,230 secs (30,822 CPU) processed `contrast equalizer' on CPU, blended on CPU [export]
8,305542 [dev_pixelpipe] took 0,087 secs (1,315 CPU) processed `local contrast' on CPU, blended on CPU [export]
8,423419 [dev_pixelpipe] took 0,118 secs (1,730 CPU) processed `color zones' on CPU, blended on CPU [export]
8,456511 [dev_pixelpipe] took 0,033 secs (0,492 CPU) processed `levels' on CPU, blended on CPU [export]
8,503149 [dev_pixelpipe] took 0,047 secs (0,716 CPU) processed `sharpen' on CPU, blended on CPU [export]
8,525677 [dev_pixelpipe] took 0,022 secs (0,313 CPU) processed `color contrast' on CPU, blended on CPU [export]
8,579389 [dev_pixelpipe] took 0,054 secs (0,860 CPU) processed `output color profile' on CPU, blended on CPU [export]
8,579426 [dev_process_export] pixel pipeline processing took 7,629 secs (105,696 CPU)

So do I read these correctly.
.for your runs the pipeline is about 3x faster with Opencl but slower to export ??

where do you get this information from? :thinking:

I think you’re misreading the data.

Hold my hand a bit longer…what does the last line mean…I thought it gave the Number of seconds to export the file and the number to process the pipeline ??

14.9 for export 2.939 pipeline with open CL

8.6 for export and 7.6 for pipeline with no Opencl…

Clearly I don’t understand the numbers I guess

As I understand it, in
pixel pipeline processing took 7,629 secs (105,696 CPU)
the first value is the time spent according to the “wall clock”, i.e. the user had to wait 7.63s here;
the second value is the accumulated time each involved hyperthread spent (the CPU @da-phil uses has 8 cores with 2 threads/core).
All the lines with 2 values should be read like that.

So openCL ran about 3x faster, and used less than a tenth of the CPU time compared to the run without openCL (where of course the CPU had to do all the work). Even when using openCL the CPU still has to do some work, like preparing the transfers to and from the GPU.

And some steps are still done on the CPU, even with openCL enabled (“highlight reconstruction”, “demosaic” and “tone mapping”).

THere’s one other caveat: this test image is done with the display-referred pipeline. I have no idea about the gains of openCL in the scene-referred pipeline.

1 Like

Thanks …sorry to be thick…what is the first number (almost 2x faster with no OPENCL)… and I am not sure I understand the CPU number is it cycles or some unit of time??

eg

8,579426 [dev_process_export] pixel pipeline processing took 7,629 secs (105,696 CPU)

what is the 8.579246?? and the 105,696??

The first is just a timestamp (time since starting darktable, or something like that).

The second pair is “wall clock” vs total CPU time. So, if you ran 10 threads for 50 ms, that would be 50 (500).

8,579 etc. is total elapsed time since the start of the run (including loading the file etc.), and the 105,696 is the total time summed over all hyperthreads (so each thread spent on average 6.6 s)

1 Like

Okay I understand now …thanks for taking the time to explain. I appreciate its…so then if the time is say 8.5 odd seconds from the start in CPU and 14 and some odd with OpenCL enabled…what is that extra time and is that a performance metric…its funny I thought that it reflected the time to process and export the file specified in the command line… :slight_smile:

With OpenCL:

Processing started at 12.347238 seconds of ‘process time’ (time since starting darktable); it ended at 14.857717. The difference is 2.510479 → that’s close to the ‘processing took 2.393’ in the log (there’s also some overhead). During this time, some of the processing was more or less parallelised. The total CPU resource consumed was 8.313 seconds. Average number of active threads: 8.313/2.393=3.47.
Think of that as ‘date’, ‘calendar days’ and ‘man days’ on a project.
Each log line starts with when something was logged; then it tells you how much time (as seen by the user) was spent in that step; finally, what was the resource (CPU) consumption / the effort made.

Same for without OpenCL:

0,838459 [dev] took 0,177 secs (0,194 CPU) to load the image.
...
8,579426 [dev_process_export] pixel pipeline processing took 7,629 secs (105,696 CPU)

Start: 0.838 seconds of ‘process time’; end: 8.579. Difference: 7.741 seconds (close to the 7.629 printed). The CPU had a lot more threads running this time: 105.696/7.629 = 13.85.

1 Like

And, if you run darktable (not the -cli version for processing only), you’ll see that the first log entries are close to 0, then if you don’t do anything for a while and then perform an operation, you’ll see large gaps (when both you and the program were idling):

kofa@eagle:~$ darktable  -d opencl -d perf
0.224779 [opencl_init] opencl related configuration options:
0.224792 [opencl_init] 
0.224793 [opencl_init] opencl: 1
... loading stuff during startup ...
0.456995 [opencl_synchronization_timeout] synchronization timeout set to 0
... building UI and loading stuff from the DB to populate the lighttable, I think ...
... processing of thumbnails starting...
0.838314 [opencl_create_kernel] successfully loaded kernel `shadows_highlights_mix' (51) for device 0
2.009077 [dev_process_thumbnail] pixel pipeline processing took 0.094 secs (0.233 CPU)
19.904589 [dev] took 0.001 secs (0.000 CPU) to load the image.

Between the bottom 2 lines (2.009 and 19.904), I used the UI to select an image to load. Since I had selected logging for performance and OpenCL only, file I/O and DB updates were omitted from the log; time seems to ‘jump’ in the log output.

There are several reasons why the OpenCL run could have used so much extra start-up time.

Linux caches file operations, so if a file is asked for several times in a short period, only the first request actually involves loading the file from disk, subsequent reads are in-memory. And disk operations are orders of magnitude slower than memory operations. That in itself could explain the difference in startup between the two runs.

In addition, openCL uses kernels (sub-programs specific to your card/driver combination), that have to be compiled once, and are then cached on disk. If a recompile is needed for any reason (like a driver update), that will also take startup time.

2 Likes