OpenCL analysis... Darktable... much faster with Opencl disabled...something wrong??

And if you compile yourself, make sure to create a ‘release build’.
--build-type Release instead of RelWithDebInfo.

This can have a huge performance impact (I’m not talking about 20%, but rather doubling the speed):

1 Like

Hardware: Ryzen 5 5600X, release build, NVidia 1060/6GB

With OpenCL:

kofa@eagle:/tmp$ rm -rf ~/.cache/darktable/mipmaps-6ac0672dac6fe81d5e75505cc7fa15bfeed1acf8.d/ ; ~/darktable-master/bin/darktable-cli bench.SRW bench.SRW.xmp output.pfm --width 0 --height 0 --hq 1 --core -d perf
0.588920 [dev] took 0.101 secs (0.091 CPU) to load the image.
[_dev_auto_apply_presets] missing mandatory module rawprepare for image 1
[_dev_auto_apply_presets] missing mandatory module colorin for image 1
[_dev_auto_apply_presets] missing mandatory module colorout for image 1
[_dev_auto_apply_presets] missing mandatory module gamma for image 1
0.631202 [export] creating pixelpipe took 0.038 secs (0.372 CPU)
0.635601 [dev_pixelpipe] took 0.004 secs (0.015 CPU) initing base buffer [export]
0.642216 [dev_pixelpipe] took 0.007 secs (0.062 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
0.644776 [dev_pixelpipe] took 0.003 secs (0.000 CPU) processed `white balance' on GPU, blended on GPU [export]
0.708916 [dev_pixelpipe] took 0.064 secs (0.468 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
0.797900 [dev_pixelpipe] took 0.089 secs (0.986 CPU) processed `demosaic' on CPU, blended on CPU [export]
1.063531 [dev_pixelpipe] took 0.266 secs (2.177 CPU) processed `tone mapping' on CPU, blended on CPU [export]
1.218839 [dev_pixelpipe] took 0.155 secs (0.471 CPU) processed `lens correction' on GPU, blended on GPU [export]
1.233897 [dev_pixelpipe] took 0.015 secs (0.007 CPU) processed `base curve' on GPU, blended on GPU [export]
1.249046 [dev_pixelpipe] took 0.015 secs (0.007 CPU) processed `input color profile' on GPU, blended on GPU [export]
1.265584 [dev_pixelpipe] took 0.017 secs (0.012 CPU) processed `color reconstruction' on GPU, blended on GPU [export]
3.329972 [dev_pixelpipe] took 2.064 secs (2.043 CPU) processed `astrophoto denoise' on GPU, blended on GPU [export]
3.372937 [dev_pixelpipe] took 0.043 secs (0.031 CPU) processed `global tonemap' on GPU, blended on GPU [export]
3.448512 [dev_pixelpipe] took 0.076 secs (0.056 CPU) processed `shadows and highlights' on GPU, blended on GPU [export]
3.788413 [dev_pixelpipe] took 0.340 secs (0.296 CPU) processed `contrast equalizer' on GPU, blended on GPU [export]
3.812484 [dev_pixelpipe] took 0.024 secs (0.016 CPU) processed `local contrast' on GPU, blended on GPU [export]
3.838831 [dev_pixelpipe] took 0.026 secs (0.015 CPU) processed `color zones' on GPU, blended on GPU [export]
3.851727 [dev_pixelpipe] took 0.013 secs (0.009 CPU) processed `levels' on GPU, blended on GPU [export]
3.879112 [dev_pixelpipe] took 0.027 secs (0.015 CPU) processed `sharpen' on GPU, blended on GPU [export]
3.891835 [dev_pixelpipe] took 0.013 secs (0.009 CPU) processed `color contrast' on GPU, blended on GPU [export]
4.252303 [dev_pixelpipe] took 0.360 secs (3.890 CPU) processed `output color profile' on CPU, blended on CPU [export]
4.252455 [dev_process_export] pixel pipeline processing took 3.621 secs (10.586 CPU)
[export_job] exported to `output.pfm'

Without OpenCL:

kofa@eagle:/tmp$ rm -rf ~/.cache/darktable/mipmaps-6ac0672dac6fe81d5e75505cc7fa15bfeed1acf8.d/ ; ~/darktable-master/bin/darktable-cli bench.SRW bench.SRW.xmp output.pfm --width 0 --height 0 --hq 1 --core -d perf --disable-opencl
0.488711 [dev] took 0.101 secs (0.110 CPU) to load the image.
[_dev_auto_apply_presets] missing mandatory module rawprepare for image 1
[_dev_auto_apply_presets] missing mandatory module colorin for image 1
[_dev_auto_apply_presets] missing mandatory module colorout for image 1
[_dev_auto_apply_presets] missing mandatory module gamma for image 1
0.531560 [export] creating pixelpipe took 0.039 secs (0.386 CPU)
0.535819 [dev_pixelpipe] took 0.004 secs (0.013 CPU) initing base buffer [export]
0.542798 [dev_pixelpipe] took 0.007 secs (0.038 CPU) processed `raw black/white point' on CPU, blended on CPU [export]
0.550401 [dev_pixelpipe] took 0.008 secs (0.065 CPU) processed `white balance' on CPU, blended on CPU [export]
0.592788 [dev_pixelpipe] took 0.042 secs (0.515 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
0.682106 [dev_pixelpipe] took 0.089 secs (0.962 CPU) processed `demosaic' on CPU, blended on CPU [export]
0.947667 [dev_pixelpipe] took 0.265 secs (2.104 CPU) processed `tone mapping' on CPU, blended on CPU [export]
1.296137 [dev_pixelpipe] took 0.348 secs (3.047 CPU) processed `lens correction' on CPU, blended on CPU [export]
1.333320 [dev_pixelpipe] took 0.037 secs (0.426 CPU) processed `base curve' on CPU, blended on CPU [export]
1.357736 [dev_pixelpipe] took 0.024 secs (0.280 CPU) processed `input color profile' on CPU, blended on CPU [export]
1.429356 [dev_pixelpipe] took 0.072 secs (0.831 CPU) processed `color reconstruction' on CPU, blended on CPU [export]
3.174414 [dev_pixelpipe] took 1.745 secs (20.621 CPU) processed `astrophoto denoise' on CPU, blended on CPU [export]
3.285510 [dev_pixelpipe] took 0.111 secs (1.312 CPU) processed `global tonemap' on CPU, blended on CPU [export]
3.489421 [dev_pixelpipe] took 0.204 secs (2.201 CPU) processed `shadows and highlights' on CPU, blended on CPU [export]
5.746928 [dev_pixelpipe] took 2.257 secs (25.478 CPU) processed `contrast equalizer' on CPU, blended on CPU [export]
5.807564 [dev_pixelpipe] took 0.061 secs (0.686 CPU) processed `local contrast' on CPU, blended on CPU [export]
5.890651 [dev_pixelpipe] took 0.083 secs (0.983 CPU) processed `color zones' on CPU, blended on CPU [export]
5.926817 [dev_pixelpipe] took 0.036 secs (0.409 CPU) processed `levels' on CPU, blended on CPU [export]
5.966252 [dev_pixelpipe] took 0.039 secs (0.456 CPU) processed `sharpen' on CPU, blended on CPU [export]
5.989733 [dev_pixelpipe] took 0.023 secs (0.268 CPU) processed `color contrast' on CPU, blended on CPU [export]
6.314755 [dev_pixelpipe] took 0.325 secs (3.870 CPU) processed `output color profile' on CPU, blended on CPU [export]
6.314858 [dev_process_export] pixel pipeline processing took 5.783 secs (64.578 CPU)
[export_job] exported to `output.pfm'

It’s interesting to see identical cards (1060/6GB) performing so differently. Release build vs. RelWithDebInfo should not affect the GPU, AFAIK. Worse GPU time and better CPU time for the same module.

Maybe the reason is the CPU? Mine is a i5-4570 (introduction in 2013), yours is a Ryzen 5 5600X (introduction in 2021). The CPU obvioulsy takes a larger share of the computing (i5: 0,500 CPU, R5: 2.043 CPU).

Ya I had made that change after following that thread. I had just been using build.sh to build DT and I think it was set to create RelWithDebInfo version of DT…Did you tweak any of your other OPENCL settings other than the headroom…

The complete set is:

opencl=TRUE
opencl_async_pixelpipe=true
opencl_avoid_atomics=false
opencl_checksum=3316763402
opencl_device_priority=+*/+*/+*/+*/+*
opencl_disable_drivers_blacklist=false
opencl_library=
opencl_mandatory_timeout=200
opencl_memory_headroom=800
opencl_memory_requirement=768
opencl_micro_nap=0
opencl_number_event_handles=1000
opencl_scheduling_profile=very fast GPU
opencl_size_roundup=16
opencl_synch_cache=active module
opencl_use_cpu_devices=false
opencl_use_pinned_memory=false

Descriptions:

  • opencl_async_pixelpipe: if set to TRUE OpenCL pixelpipe will not be synchronized on a per-module basis. this can improve pixelpipe latency. however, potential OpenCL errors would be detected late; in such a case the complete pixelpipe needs to be reprocessed instead of only a single module. export pixelpipe will always be run synchronously.
  • opencl_avoid_atomics: if set to TRUE darktable will not use OpenCL kernels which contain atomic operations (example bilateral). pixelpipe processing will be done on CPU for the affected modules. useful if your OpenCL implementation freezes/crashes on atomics or if they are processed with a bad performance.
  • opencl_checksum: darktable re-checks the performance benchmarks of your system in case your setup has changed, which is indicated by a change versus the stored checksum in this config variable; darktable de-activates opencl if the GPU benchmark lies below the one of the CPU; initial value is the empty string; set to OFF if you want to deactivate any automatic checks and prefer to do all configurations manually.
  • opencl_device_priority: (ignored because of opencl_scheduling_profile=very fast GPU) defines priorities on how (multiple) OpenCL devices are allocated to the different types of pixelpipe (full, preview, export, thumbnail, preview2). for more details visit our usermanual (needs a restart).
  • opencl_disable_drivers_blacklist: not all opencl implementations are fully-functional, there are at least two, which are not production-quality yet: pocl and beignet; therefore at this point in time, it is better to explicitly blacklist them by default. can be changed by setting this configuration option to TRUE (needs a restart).
  • opencl_library: OpenCL runtime library is normally detected automatically by darktable. if your OpenCL runtime is at an unusual place and cannot be detected, enter the full pathname here. leave empty for default behavior.
  • opencl_mandatory_timeout: time period (in units of 5ms) after which we give up try-locking an opencl device for mandatory use. defaults to 200.
  • opencl_memory_headroom: this amount of memory (in MB) will be subtracted from total GPU memory in order to calculate the available OpenCL memory. too low values will lead to out-of-memory situations in OpenCL processing. too high values will lead to unnecessary tiling (needs a restart).
  • opencl_memory_requirement: OpenCL will only be activated if your graphics card has at least this amount of memory. reducing the value will allow cards with less GPU memory to be used - but at the risk of lower system stability and occasional crashes. values below 200 will be treated as 200.
  • opencl_micro_nap: for slow GPUs this gives your graphics driver some time to breathe to do needed screen updates. can be left at zero for fast devices.
  • opencl_number_event_handles: a positive non-zero integer defines the number of event handles that darktable may have opened on a device. a value of -1 does not pose any restrictions, bearing the risk of hitting the device’s resource limits. a value of zero completely prevents the use of event handles.
  • opencl_scheduling_profile: defines how preview and full pixelpipe tasks are scheduled on OpenCL enabled systems. default - GPU processes full and CPU processes preview pipe (adaptable by config parameters); multiple GPUs - process both pixelpipes in parallel on two different GPUs; very fast GPU - process both pixelpipes sequentially on the GPU.
  • opencl_size_roundup: in OpenCL processing round width/height of global work groups to a multiple of this value. reasonable values are powers of 2. this parameter can have high impact on OpenCL performance.
  • opencl_synch_cache: active module (default) - cache the input to the currently focused module, which allows for faster response time when making multiple adjustments to that module (though the whole pipeline may need to be reprocessed when another module is changed); true - cache the output after each module, which may improve speed, as the whole pixelpipe won’t be reprocessed on every parameter change, though will require more memory transfers from the GPU; false - do not sync the pixelpipe cache from OpenCL, which avoids memory transfers from GPUs fast enough to smoothly reprocess the whole pixelpipe.
  • opencl_use_cpu_devices: typically darktable’s hand-optimized CPU code is much faster than any OpenCL-on-CPU code; therefore CPUs are excluded from being used as OpenCL devices by default. can be changed by setting this configuration option to TRUE (needs a restart).
  • opencl_use_pinned_memory: during tiling huge amounts of memory need to be transferred between host and device. for some OpenCL implementations direct memory transfers give a drastic performance penalty. this can often be avoided by using indirect transfers via pinned memory. other devices have more efficient direct memory transfer implementations. AMD seems to belong to the first group, nvidia to the second.

Non-default values in my config:

  • opencl_async_pixelpipe=true (instead of false) - in order to improve latency; I see no issues mentioned in the description
  • opencl_memory_headroom=800 (instead of 400) - for me, OpenCL was not working reliably with less
  • opencl_micro_nap=0 (instead of 1000) - since the docs say ‘can be left at zero for fast devices’
  • opencl_number_event_handles=1000 (instead of 25) - more is probably better, I guess :slight_smile: I see no issues that would indicate ‘hitting the device’s resource limits’
  • opencl_scheduling_profile=very fast GPU (instead of default, but darktable tries to find out the correct value by doing a very simple measurement at start-up, if this value is not present in the config – the measurement is not very reliable, though:
5 Likes

You are very kind to spell it all out …I’ll experiment with a few combinations…I wonder it if would ever be possible to write and optimization script…it seems like there could be a lot of hidden performance out there for some… Thanks again…

Following your settings I changed opencl_micro_nap=0 (instead of 1000)

which lead to a significant speedup:
[dev_pixelpipe] took 2,126 secs (0,312 CPU) processed `astrophoto denoise’ on GPU, blended on GPU [export]

and total:
[dev_process_export] pixel pipeline processing took 6,121 secs (8,047 CPU)
(Instead of 8,207 secs :grinning:)

I also changed opencl_number_event_handles=1000 (instead of 25)

[dev_pixelpipe] took 2,120 secs (0,391 CPU) processed `astrophoto denoise’ on GPU, blended on GPU [export]
[dev_process_export] pixel pipeline processing took 6,094 secs (8,266 CPU)

Not much improvement here, but also a bit better.

All the other parameters were already set like yours.

Thanks a lot !

Changing this from 400 to 800 gave a massive improvement on my PC.

The total has gone from around 23 secs (62 CPU) to 10 secs (16 CPU).

Trying 1200 made it slower at around 12 secs (20 CPU).

The other changes made little if any difference.

Thank you all for posting as this really helps.

The headroom is the amount which darktable has to subtract from the memory available on the card, because it’s potentially used by other apps. As far as I know, there is no standard way to ask how much free memory that card has. If the value is too low, darktable will run into problems when trying to allocate memory on the video card; if it’s too high, darktable will not use the memory optimally.

See also: Which benchmarks provide an estimate to enable me compare and decide on which GPU to buy, for Image Processing ? - #13 by kofa

1 Like

I have the following setup:

  • AMD Ryzen 7 1700 (8 cores)
  • AMD Radeon RX 5700 XT 8 GB
  • 32GB RAM
  • Ubuntu 20.4.3
  • amdgpu driver version 21.40.1.40501 (installed via amdgpu-install --opencl=rocr --vulkan=pro)
  • darktable built from master via plain build.sh command (version 3.7.0+1656~gb9fecd246)

with OpenCL:

12,347238 [dev] took 0,181 secs (0,191 CPU) to load the image.
12,464653 [export] creating pixelpipe took 0,112 secs (0,796 CPU)
12,464729 [pixelpipe_process] [export] using device 0
12,469442 [dev_pixelpipe] took 0,004 secs (0,012 CPU) initing base buffer [export]
12,484404 [dev_pixelpipe] took 0,015 secs (0,044 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
12,486257 [dev_pixelpipe] took 0,002 secs (0,001 CPU) processed `white balance' on GPU, blended on GPU [export]
12,618171 [dev_pixelpipe] took 0,132 secs (1,099 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
12,800475 [dev_pixelpipe] took 0,182 secs (2,410 CPU) processed `demosaic' on CPU, blended on CPU [export]
13,206668 [dev_pixelpipe] took 0,406 secs (4,031 CPU) processed `tone mapping' on CPU, blended on CPU [export]
13,468421 [dev_pixelpipe] took 0,262 secs (0,631 CPU) processed `lens correction' on GPU, blended on GPU [export]
13,474133 [dev_pixelpipe] took 0,006 secs (0,002 CPU) processed `base curve' on GPU, blended on GPU [export]
13,482080 [dev_pixelpipe] took 0,008 secs (0,001 CPU) processed `input color profile' on GPU, blended on GPU [export]
13,533536 [dev_pixelpipe] took 0,051 secs (0,000 CPU) processed `color reconstruction' on GPU, blended on GPU [export]
14,517677 [dev_pixelpipe] took 0,984 secs (0,055 CPU) processed `astrophoto denoise' on GPU, blended on GPU [export]
14,580533 [dev_pixelpipe] took 0,063 secs (0,000 CPU) processed `global tonemap' on GPU, blended on GPU [export]
14,608004 [dev_pixelpipe] took 0,027 secs (0,004 CPU) processed `shadows and highlights' on GPU, blended on GPU [export]
14,677770 [dev_pixelpipe] took 0,070 secs (0,002 CPU) processed `contrast equalizer' on GPU, blended on GPU [export]
14,759108 [dev_pixelpipe] took 0,081 secs (0,001 CPU) processed `local contrast' on GPU, blended on GPU [export]
14,765870 [dev_pixelpipe] took 0,007 secs (0,000 CPU) processed `color zones' on GPU, blended on GPU [export]
14,770281 [dev_pixelpipe] took 0,004 secs (0,001 CPU) processed `levels' on GPU, blended on GPU [export]
14,779443 [dev_pixelpipe] took 0,009 secs (0,001 CPU) processed `sharpen' on GPU, blended on GPU [export]
14,782962 [dev_pixelpipe] took 0,003 secs (0,001 CPU) processed `color contrast' on GPU, blended on GPU [export]
14,790542 [dev_pixelpipe] took 0,008 secs (0,000 CPU) processed `output color profile' on GPU, blended on GPU [export]
14,857466 [opencl_profiling] profiling device 0 ('gfx1010:xnack-'):
14,857499 [opencl_profiling] spent  0,0508 seconds in [Write Image (from host to device)]
14,857508 [opencl_profiling] spent  0,0004 seconds in rawprepare_1f
14,857513 [opencl_profiling] spent  0,0004 seconds in whitebalance_1f
14,857518 [opencl_profiling] spent  0,0607 seconds in [Read Image (from device to host)]
14,857522 [opencl_profiling] spent  0,0084 seconds in [Copy Image (on device)]
14,857527 [opencl_profiling] spent  0,0017 seconds in [Write Buffer (from host to device)]
14,857531 [opencl_profiling] spent  0,0093 seconds in lens_distort_bicubic
14,857535 [opencl_profiling] spent  0,0020 seconds in basecurve_legacy_lut
14,857539 [opencl_profiling] spent  0,0019 seconds in colorin_unbound
14,857550 [opencl_profiling] spent  0,0000 seconds in colorreconstruction_zero
14,857555 [opencl_profiling] spent  0,0477 seconds in colorreconstruction_splat
14,857559 [opencl_profiling] spent  0,0000 seconds in [Copy Buffer to Buffer (on device)]
14,857563 [opencl_profiling] spent  0,0000 seconds in colorreconstruction_blur_line
14,857567 [opencl_profiling] spent  0,0018 seconds in colorreconstruction_slice
14,857571 [opencl_profiling] spent  0,0009 seconds in nlmeans_init
14,857575 [opencl_profiling] spent  0,1251 seconds in nlmeans_dist
14,857579 [opencl_profiling] spent  0,0898 seconds in nlmeans_horiz
14,857583 [opencl_profiling] spent  0,2214 seconds in nlmeans_vert
14,857587 [opencl_profiling] spent  0,3587 seconds in nlmeans_accu
14,857591 [opencl_profiling] spent  0,0028 seconds in nlmeans_finish
14,857596 [opencl_profiling] spent  0,0014 seconds in pixelmax_first
14,857600 [opencl_profiling] spent  0,0000 seconds in pixelmax_second
14,857604 [opencl_profiling] spent  0,0000 seconds in [Read Buffer (from device to host)]
14,857608 [opencl_profiling] spent  0,0000 seconds in zero
14,857612 [opencl_profiling] spent  0,0691 seconds in splat
14,857616 [opencl_profiling] spent  0,0016 seconds in global_tonemap_drago
14,857620 [opencl_profiling] spent  0,0001 seconds in blur_line
14,857624 [opencl_profiling] spent  0,0000 seconds in blur_line_z
14,857628 [opencl_profiling] spent  0,0024 seconds in slice_to_output
14,857632 [opencl_profiling] spent  0,0019 seconds in [Copy Image to Buffer (on device)]
14,857636 [opencl_profiling] spent  0,0111 seconds in gaussian_column_4c
14,857640 [opencl_profiling] spent  0,0039 seconds in gaussian_transpose_4c
14,857643 [opencl_profiling] spent  0,0017 seconds in [Copy Buffer to Image (on device)]
14,857647 [opencl_profiling] spent  0,0029 seconds in shadows_highlights_mix
14,857651 [opencl_profiling] spent  0,0425 seconds in eaw_decompose
14,857655 [opencl_profiling] spent  0,0197 seconds in eaw_synthesize
14,857659 [opencl_profiling] spent  0,0016 seconds in slice
14,857663 [opencl_profiling] spent  0,0021 seconds in colorzones_v3
14,857667 [opencl_profiling] spent  0,0018 seconds in levels
14,857670 [opencl_profiling] spent  0,0021 seconds in sharpen_hblur
14,857674 [opencl_profiling] spent  0,0026 seconds in sharpen_vblur
14,857678 [opencl_profiling] spent  0,0025 seconds in sharpen_mix
14,857682 [opencl_profiling] spent  0,0019 seconds in colorcontrast
14,857686 [opencl_profiling] spent  0,0028 seconds in colorout
14,857690 [opencl_profiling] spent  1,1593 seconds totally in command queue (with 0 events missing)
14,857717 [dev_process_export] pixel pipeline processing took 2,393 secs (8,313 CPU)

without OpenCL:

0,838459 [dev] took 0,177 secs (0,194 CPU) to load the image.
0,950538 [export] creating pixelpipe took 0,107 secs (0,699 CPU)
0,950609 [pixelpipe_process] [export] using device -1
0,959993 [dev_pixelpipe] took 0,004 secs (0,041 CPU) initing base buffer [export]
0,975859 [dev_pixelpipe] took 0,016 secs (0,155 CPU) processed `raw black/white point' on CPU, blended on CPU [export]
0,984521 [dev_pixelpipe] took 0,008 secs (0,072 CPU) processed `white balance' on CPU, blended on CPU [export]
1,059209 [dev_pixelpipe] took 0,074 secs (1,073 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
1,253666 [dev_pixelpipe] took 0,194 secs (2,309 CPU) processed `demosaic' on CPU, blended on CPU [export]
1,668573 [dev_pixelpipe] took 0,415 secs (4,023 CPU) processed `tone mapping' on CPU, blended on CPU [export]
2,161536 [dev_pixelpipe] took 0,493 secs (5,837 CPU) processed `lens correction' on CPU, blended on CPU [export]
2,201157 [dev_pixelpipe] took 0,040 secs (0,543 CPU) processed `base curve' on CPU, blended on CPU [export]
2,228603 [dev_pixelpipe] took 0,027 secs (0,399 CPU) processed `input color profile' on CPU, blended on CPU [export]
2,375968 [dev_pixelpipe] took 0,147 secs (1,993 CPU) processed `color reconstruction' on CPU, blended on CPU [export]
5,527111 [dev_pixelpipe] took 3,151 secs (46,505 CPU) processed `astrophoto denoise' on CPU, blended on CPU [export]
5,681005 [dev_pixelpipe] took 0,154 secs (2,268 CPU) processed `global tonemap' on CPU, blended on CPU [export]
5,987418 [dev_pixelpipe] took 0,306 secs (4,148 CPU) processed `shadows and highlights' on CPU, blended on CPU [export]
8,218073 [dev_pixelpipe] took 2,230 secs (30,822 CPU) processed `contrast equalizer' on CPU, blended on CPU [export]
8,305542 [dev_pixelpipe] took 0,087 secs (1,315 CPU) processed `local contrast' on CPU, blended on CPU [export]
8,423419 [dev_pixelpipe] took 0,118 secs (1,730 CPU) processed `color zones' on CPU, blended on CPU [export]
8,456511 [dev_pixelpipe] took 0,033 secs (0,492 CPU) processed `levels' on CPU, blended on CPU [export]
8,503149 [dev_pixelpipe] took 0,047 secs (0,716 CPU) processed `sharpen' on CPU, blended on CPU [export]
8,525677 [dev_pixelpipe] took 0,022 secs (0,313 CPU) processed `color contrast' on CPU, blended on CPU [export]
8,579389 [dev_pixelpipe] took 0,054 secs (0,860 CPU) processed `output color profile' on CPU, blended on CPU [export]
8,579426 [dev_process_export] pixel pipeline processing took 7,629 secs (105,696 CPU)

So do I read these correctly.
.for your runs the pipeline is about 3x faster with Opencl but slower to export ??

where do you get this information from? :thinking:

I think you’re misreading the data.

Hold my hand a bit longer…what does the last line mean…I thought it gave the Number of seconds to export the file and the number to process the pipeline ??

14.9 for export 2.939 pipeline with open CL

8.6 for export and 7.6 for pipeline with no Opencl…

Clearly I don’t understand the numbers I guess

As I understand it, in
pixel pipeline processing took 7,629 secs (105,696 CPU)
the first value is the time spent according to the “wall clock”, i.e. the user had to wait 7.63s here;
the second value is the accumulated time each involved hyperthread spent (the CPU @da-phil uses has 8 cores with 2 threads/core).
All the lines with 2 values should be read like that.

So openCL ran about 3x faster, and used less than a tenth of the CPU time compared to the run without openCL (where of course the CPU had to do all the work). Even when using openCL the CPU still has to do some work, like preparing the transfers to and from the GPU.

And some steps are still done on the CPU, even with openCL enabled (“highlight reconstruction”, “demosaic” and “tone mapping”).

THere’s one other caveat: this test image is done with the display-referred pipeline. I have no idea about the gains of openCL in the scene-referred pipeline.

1 Like

Thanks …sorry to be thick…what is the first number (almost 2x faster with no OPENCL)… and I am not sure I understand the CPU number is it cycles or some unit of time??

eg

8,579426 [dev_process_export] pixel pipeline processing took 7,629 secs (105,696 CPU)

what is the 8.579246?? and the 105,696??

The first is just a timestamp (time since starting darktable, or something like that).

The second pair is “wall clock” vs total CPU time. So, if you ran 10 threads for 50 ms, that would be 50 (500).

8,579 etc. is total elapsed time since the start of the run (including loading the file etc.), and the 105,696 is the total time summed over all hyperthreads (so each thread spent on average 6.6 s)

1 Like