Darktable - Windows performance

Hi,
I went through some paid and not paid RAW management/edit applications and after years I finally decided to stay with Darktable. It had not the easiest learning curve, but I really appreciate it after learning some basics. However I was wondering if I can improve its performance on my aging hardware (Ryzen 7 1700/16GB RAM/RX 570 8GB). For a start I want to replace GPU with a more current one. I was thinking about getting either RX 7600 or Radeon Pro W6600 (both 8 GB DDR6) as this latter is now on sale cheaper than a former one. Both show results roughly 2x better than my current in Compubench Open CL benchmarks (and ditto for DirectX/OpenGL performance). What is you experience with such a improvement under Windows (11)? Or maybe Nvidia performs better than comparable AMD GPU with Darktable on Windows? I did some benchmarks on my current system using this know methodology:
GPU benchmarks in darktable (dartmouth.edu)
and current results are far behind results I can see on this page. So here they are:
Ryzen 7-1700 only: 16,6 sec
Ryzen 7-1700 + RX 580: 7,7 sec
So not too great I think. Or maybe Windows version of Darktable is not that good optimized compared to Linux version? What are your thoughts?

The GPU-based code is always built locally, for the specific graphics card. The RX 580 is quite an old card, in the same league (but slower) than my Nvidia 1060, at least according to Radeon RX 580 vs GeForce GTX 1060 [videocardbenchmark.net] by PassMark Software (but with more memory, it will tile less frequently, so it could also beat my card, depending on the image and the operations).

1 Like

Thanks, I’ll probably be able to borrow GTX 1080 for tests, so I’ll see if there’s any real difference between RX 580, GTX 1080 and my son’s RX 7600 (but coupled with his Ryzen 5 7600), I’ll post results here for curious.

1 Like

At some point it is cheaper/better to just upgrade the entire system than trying to add a card to an old system. The first problem could be not having enough power supply to handle the new card.

With a new system you also get faster SSD drives, faster CPU with more cores, more/faster RAM…

FYI, I’ve tested dt in windows and linux (dual boot) and I could not see a performance difference in export times. Hardware and drivers for the hardware are the most important factors.

2 Likes

I concur that it is better to build whole new system, but I can’t afford it at the moment (I plan to to this later this year) and I thought starting with GPU, which would work in my current system would be a good idea for beginning :slight_smile: However it is good to know that DT under Windows is not lagging performance-wise behind Linux.

I kept my very old Core2 Duo system (built in 2008) operational until 2021 by upgrading the GPU. Since I set darktable to process everything on the GPU, it remained usable (until the lack of RAM on the motherboard started causing OpenCL issues – I only had 4 GB of RAM, the GPU had 6). I replaced the power supply twice, and switched to SSD (still SATA, though). Then I got the new motherboard, RAM and CPU, removed the old ones, put the new ones back in the same case, and fired it up. Linux kept running.

1 Like

Got promised GTX 1080 benchmarks - roughly 34% faster than RX 570, but that was expected. Whole Open CL on Nvidia benchmark took 5.5 seconds. Anybody willing to test their systems using this simple benchmark? :slight_smile: Benchmark files are available here:
https://math.dartmouth.edu/~sarunas/darktable_bench.html

I don’t know how relevant that is. It uses a pretty old stack.

Stack? It is just a RAW picture and .XMP with set of Darktable edits to be applied using darktable-cli using either CPU-only or CPU+GPU set as processors.

Processing stack, the history in the XMP.

Ah these. Just looked through XMP and it seems many modules there are rather current:
hazeremoval
exposure
flip
colorbalancergb
Still a good option to benchmark.

diffuse or sharpen is one that’s missing, and one that’s both heavy-weight and part of my standard toolchain.

2 Likes

OK, finally got new Radeon Pro W6600 on sale for less than 250$. Not much improvement, my benchmark shows slightly over 6 seconds to complete. And @g-man was right - it is better to have whole system upgraded, as GPU is just one part of whole processing path, CPU still has its role and its power should match GPU. And I’ll do it in a few months, I plan to build Ryzen 9 7900 based system. Well at least my new card is now a first element of this future setup and it is much more energy efficient :slight_smile: Thanks all for replies!

With OpenCL:
darktable-cli setubal.orf setubal.orf.xmp test.jpg --core -d perf -d opencl: [dev_process_export] pixel pipeline processing took 5.563 secs (13.421 CPU)

Without:
darktable-cli setubal.orf setubal.orf.xmp test.jpg --core -d perf -d opencl --disable-opencl: [dev_process_export] pixel pipeline processing took 13.078 secs (127.521 CPU)

In GPU-compute, your card should be about twice as fast as mine.

            Radeon PRO W6600   GeForce GTX 1060
GPU Compute	9895 Ops/Sec	   4322 Ops/Sec (-56.3%)

(Radeon PRO W6600 vs GeForce GTX 1060 [videocardbenchmark.net] by PassMark Software)

My OpenCL logs
2.2212 [dt_dev_load_raw] loading the image. took 0.587 secs (0.563 CPU)
2.2789 [export] creating pixelpipe took 0.055 secs (0.398 CPU)
2.2790 [dt_opencl_check_tuning] use 4808MB (headroom=OFF, pinning=OFF) on device `NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB' id=0
2.2793 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [export]
2.2934 [dev_pixelpipe] took 0.014 secs (0.065 CPU) [export] processed `rawprepare' on GPU, blended on GPU
2.2990 [dev_pixelpipe] took 0.006 secs (0.002 CPU) [export] processed `temperature' on GPU, blended on GPU
2.3266 [dev_pixelpipe] took 0.028 secs (0.023 CPU) [export] processed `highlights' on GPU, blended on GPU
2.4592 [dev_pixelpipe] took 0.133 secs (0.127 CPU) [export] processed `hotpixels' on CPU, blended on CPU
2.5928 [dev_pixelpipe] took 0.134 secs (0.143 CPU) [export] processed `demosaic' on GPU, blended on GPU
3.9984 [dev_pixelpipe] took 1.406 secs (0.866 CPU) [export] processed `denoiseprofile' on GPU with tiling, blended on CPU
4.5705 [dev_pixelpipe] took 0.572 secs (1.564 CPU) [export] processed `lens' on GPU, blended on GPU
4.6047 [dev_pixelpipe] took 0.034 secs (0.029 CPU) [export] processed `ashift' on GPU, blended on GPU
4.6263 [dev_pixelpipe] took 0.022 secs (0.017 CPU) [export] processed `exposure' on GPU, blended on GPU
4.6620 [dev_pixelpipe] took 0.036 secs (0.027 CPU) [export] processed `colorin' on GPU, blended on GPU
4.6827 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.020 secs (0.013 GPU) [channelmixerrgb]
4.7277 [dev_pixelpipe] took 0.066 secs (0.046 CPU) [export] processed `channelmixerrgb' on GPU, blended on GPU
4.8807 [dt_ioppr_transform_image_colorspace] IOP_CS_RGB-->IOP_CS_LAB took 0.064 secs (0.657 CPU) [atrous]
5.9407 [dev_pixelpipe] took 1.213 secs (1.770 CPU) [export] processed `atrous' on GPU with tiling, blended on CPU
6.0632 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.012 secs (0.012 GPU) [colorbalancergb]
6.1078 [dev_pixelpipe] took 0.167 secs (0.146 CPU) [export] processed `colorbalancergb' on GPU, blended on GPU
6.1423 [dev_pixelpipe] took 0.034 secs (0.022 CPU) [export] processed `rgblevels' on GPU, blended on GPU
6.1713 [dev_pixelpipe] took 0.029 secs (0.020 CPU) [export] processed `sigmoid' on GPU, blended on GPU
6.3214 [dt_ioppr_transform_image_colorspace] IOP_CS_RGB-->IOP_CS_LAB took 0.059 secs (0.645 CPU) [bilat]
7.6203 [dev_pixelpipe] took 1.449 secs (8.338 CPU) [export] processed `bilat' on CPU, blended on CPU
7.7289 [dev_pixelpipe] took 0.108 secs (0.108 CPU) [export] processed `colorout' on GPU, blended on GPU
7.7331 [resample_cl] took 0.004 secs (0.000 CPU) 1:1 copy/crop of 8065x6046 pixels
7.7505 [dev_pixelpipe] took 0.022 secs (0.017 CPU) [export] processed `finalscale' on GPU, blended on GPU
7.8418 [opencl_profiling] profiling device 0 ('NVIDIA CUDA NVIDIA GeForce GTX 1060 6GB'):
7.8418 [opencl_profiling] spent  0.5348 seconds in [Write Image (from host to device)]
7.8418 [opencl_profiling] spent  0.0026 seconds in rawprepare_1f
7.8418 [opencl_profiling] spent  0.0031 seconds in whitebalance_1f
7.8418 [opencl_profiling] spent  0.0025 seconds in highlights_initmask
7.8418 [opencl_profiling] spent  0.0033 seconds in highlights_dilatemask
7.8418 [opencl_profiling] spent  0.1928 seconds in [Write Buffer (from host to device)]
7.8418 [opencl_profiling] spent  0.0075 seconds in highlights_chroma
7.8418 [opencl_profiling] spent  0.0000 seconds in [Read Buffer (from device to host)]
7.8418 [opencl_profiling] spent  0.0063 seconds in highlights_opposed
7.8418 [opencl_profiling] spent  1.0297 seconds in [Read Image (from device to host)]
7.8418 [opencl_profiling] spent  0.0008 seconds in border_interpolate
7.8418 [opencl_profiling] spent  0.0060 seconds in rcd_border_green
7.8418 [opencl_profiling] spent  0.0107 seconds in rcd_border_redblue
7.8418 [opencl_profiling] spent  0.0074 seconds in rcd_populate
7.8418 [opencl_profiling] spent  0.0052 seconds in rcd_step_1_1
7.8418 [opencl_profiling] spent  0.0040 seconds in rcd_step_1_2
7.8418 [opencl_profiling] spent  0.0025 seconds in rcd_step_2_1
7.8418 [opencl_profiling] spent  0.0065 seconds in rcd_step_3_1
7.8418 [opencl_profiling] spent  0.0037 seconds in rcd_step_4_1
7.8418 [opencl_profiling] spent  0.0020 seconds in rcd_step_4_2
7.8418 [opencl_profiling] spent  0.0058 seconds in rcd_step_5_1
7.8418 [opencl_profiling] spent  0.0093 seconds in rcd_step_5_2
7.8419 [opencl_profiling] spent  0.0099 seconds in rcd_write_output
7.8419 [opencl_profiling] spent  0.0118 seconds in denoiseprofile_precondition_Y0U0V0
7.8419 [opencl_profiling] spent  0.4297 seconds in denoiseprofile_decompose
7.8419 [opencl_profiling] spent  0.0418 seconds in denoiseprofile_reduce_first
7.8419 [opencl_profiling] spent  0.0002 seconds in denoiseprofile_reduce_second
7.8419 [opencl_profiling] spent  0.1217 seconds in denoiseprofile_synthesize
7.8419 [opencl_profiling] spent  0.0659 seconds in [Copy Image (on device)]
7.8419 [opencl_profiling] spent  0.0119 seconds in denoiseprofile_backtransform_Y0U0V0
7.8419 [opencl_profiling] spent  0.0176 seconds in lens_vignette
7.8419 [opencl_profiling] spent  0.0550 seconds in lens_distort_bicubic
7.8419 [opencl_profiling] spent  0.0261 seconds in ashift_bicubic
7.8419 [opencl_profiling] spent  0.0169 seconds in exposure
7.8419 [opencl_profiling] spent  0.0191 seconds in colorin_unbound
7.8419 [opencl_profiling] spent  0.0269 seconds in colorspaces_transform_lab_to_rgb_matrix
7.8419 [opencl_profiling] spent  0.0150 seconds in channelmixerrgb_CAT16
7.8419 [opencl_profiling] spent  0.6065 seconds in eaw_decompose
7.8419 [opencl_profiling] spent  0.1499 seconds in eaw_synthesize
7.8419 [opencl_profiling] spent  0.0180 seconds in colorbalancergb
7.8419 [opencl_profiling] spent  0.0147 seconds in rgblevels
7.8419 [opencl_profiling] spent  0.0215 seconds in sigmoid_loglogistic_per_channel
7.8419 [opencl_profiling] spent  0.0223 seconds in colorout
7.8419 [opencl_profiling] spent  3.5489 seconds totally in command queue (with 0 events missing)
7.8419 [dev_process_export] pixel pipeline processing took 5.563 secs (13.421 CPU)

Did you see excessive tiling, or other issues in your logs? I had a little bit with my GPU (in denoiseprofile → denoise (profiled) and atrous → contrast equalizer). What’s your darktable resource setting? In another benchmark, there was quite a bit of difference (4.4 vs 6 seconds) between large and normal on my machine.

1 Like

This is my benchmark log. As you can see my new GPU is slower than yours for some reason. In all (DirectX and OpenCL) benchmarks it is roughly 2 times faster than GTX1060 or RX570. I don’t know, maybe I miss some OpenCL optimizations in darktablerc? Regarding resources I set “Very fast GPU” and “Use all device memory” (W6600 has 8 GB RAM). I don’t know where I can check if there was some tiling already during processing. I’ll see this other benchmark you mentioned.

2,4995 [dt_dev_load_raw] loading the image. took 0,759 secs (0,719 CPU)
2,6760 [export] creating pixelpipe took 0,165 secs (0,156 CPU)
2,6762 [dt_opencl_check_tuning] use 7576MB (headroom=ON, pinning=OFF) on device AMD Accelerated Parallel Processing gfx1032’ id=0
2,6774 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
2,7627 [dev_pixelpipe] took 0,085 secs (0,000 CPU) [export] processed rawprepare’ on GPU, blended on GPU
2,7927 [dev_pixelpipe] took 0,030 secs (0,000 CPU) [export] processed temperature’ on GPU, blended on GPU
2,8077 [dev_pixelpipe] took 0,015 secs (0,000 CPU) [export] processed highlights’ on GPU, blended on GPU
2,9170 [dev_pixelpipe] took 0,109 secs (0,016 CPU) [export] processed hotpixels’ on CPU, blended on CPU
3,2210 [dev_pixelpipe] took 0,304 secs (0,000 CPU) [export] processed demosaic’ on GPU, blended on GPU
4,9699 [dev_pixelpipe] took 1,749 secs (0,000 CPU) [export] processed denoiseprofile’ on GPU, blended on GPU
5,9107 [dev_pixelpipe] took 0,941 secs (2,172 CPU) [export] processed lens’ on GPU, blended on GPU
5,9326 [dev_pixelpipe] took 0,022 secs (0,000 CPU) [export] processed ashift’ on GPU, blended on GPU
5,9471 [dev_pixelpipe] took 0,014 secs (0,000 CPU) [export] processed exposure’ on GPU, blended on GPU
5,9641 [dev_pixelpipe] took 0,017 secs (0,000 CPU) [export] processed colorin’ on GPU, blended on GPU
5,9714 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB–>IOP_CS_RGB took 0,003 secs (0,000 GPU) [channelmixerrgb]
6,0511 [dev_pixelpipe] took 0,087 secs (0,000 CPU) [export] processed channelmixerrgb’ on GPU, blended on GPU
6,0648 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB–>IOP_CS_LAB took 0,003 secs (0,000 GPU) [atrous]
7,8978 [dev_pixelpipe] took 1,847 secs (0,000 CPU) [export] processed atrous’ on GPU, blended on GPU
7,9117 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB–>IOP_CS_RGB took 0,003 secs (0,000 GPU) [colorbalancergb]
8,0028 [dev_pixelpipe] took 0,105 secs (0,000 CPU) [export] processed colorbalancergb’ on GPU, blended on GPU
8,0159 [dev_pixelpipe] took 0,013 secs (0,000 CPU) [export] processed rgblevels’ on GPU, blended on GPU
8,0300 [dev_pixelpipe] took 0,014 secs (0,000 CPU) [export] processed sigmoid’ on GPU, blended on GPU
8,0392 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB–>IOP_CS_LAB took 0,003 secs (0,000 GPU) [bilat]
8,7702 [dev_pixelpipe] took 0,740 secs (0,000 CPU) [export] processed bilat’ on GPU, blended on GPU
8,7967 [dev_pixelpipe] took 0,026 secs (0,000 CPU) [export] processed colorout’ on GPU, blended on GPU
8,8014 [resample_cl] took 0,000 secs (0,000 CPU) 1:1 copy/crop of 8065x6046 pixels
8,8115 [dev_pixelpipe] took 0,015 secs (0,000 CPU) [export] processed `finalscale’ on GPU, blended on GPU
9,0524 [opencl_profiling] profiling device 0 (‘AMD Accelerated Parallel Processing gfx1032’):
9,0525 [opencl_profiling] spent 0,0553 seconds in [Write Image (from host to device)]
9,0525 [opencl_profiling] spent 0,0029 seconds in rawprepare_1f
9,0526 [opencl_profiling] spent 0,0141 seconds in whitebalance_1f
9,0526 [opencl_profiling] spent 0,0019 seconds in highlights_initmask
9,0527 [opencl_profiling] spent 0,0005 seconds in highlights_dilatemask
9,0527 [opencl_profiling] spent 0,3848 seconds in [Write Buffer (from host to device)]
9,0528 [opencl_profiling] spent 0,0024 seconds in highlights_chroma
9,0528 [opencl_profiling] spent 0,0004 seconds in [Read Buffer (from device to host)]
9,0528 [opencl_profiling] spent 0,0029 seconds in highlights_opposed
9,0529 [opencl_profiling] spent 0,1666 seconds in [Read Image (from device to host)]
9,0529 [opencl_profiling] spent 0,0005 seconds in border_interpolate
9,0530 [opencl_profiling] spent 0,0020 seconds in rcd_border_green
9,0530 [opencl_profiling] spent 0,0039 seconds in rcd_border_redblue
9,0531 [opencl_profiling] spent 0,0049 seconds in rcd_populate
9,0531 [opencl_profiling] spent 0,0032 seconds in rcd_step_1_1
9,0531 [opencl_profiling] spent 0,0031 seconds in rcd_step_1_2
9,0532 [opencl_profiling] spent 0,0015 seconds in rcd_step_2_1
9,0532 [opencl_profiling] spent 0,0038 seconds in rcd_step_3_1
9,0533 [opencl_profiling] spent 0,0021 seconds in rcd_step_4_1
9,0533 [opencl_profiling] spent 0,0016 seconds in rcd_step_4_2
9,0533 [opencl_profiling] spent 0,0039 seconds in rcd_step_5_1
9,0534 [opencl_profiling] spent 0,0066 seconds in rcd_step_5_2
9,0534 [opencl_profiling] spent 0,0083 seconds in rcd_write_output
9,0535 [opencl_profiling] spent 0,0176 seconds in denoiseprofile_precondition_Y0U0V0
9,0535 [opencl_profiling] spent 0,3049 seconds in denoiseprofile_decompose
9,0535 [opencl_profiling] spent 0,2378 seconds in denoiseprofile_reduce_first
9,0536 [opencl_profiling] spent 0,0001 seconds in denoiseprofile_reduce_second
9,0536 [opencl_profiling] spent 0,2835 seconds in denoiseprofile_synthesize
9,0537 [opencl_profiling] spent 0,0790 seconds in [Copy Image (on device)]
9,0537 [opencl_profiling] spent 0,0089 seconds in denoiseprofile_backtransform_Y0U0V0
9,0537 [opencl_profiling] spent 0,0125 seconds in lens_vignette
9,0538 [opencl_profiling] spent 0,0272 seconds in lens_distort_bicubic
9,0538 [opencl_profiling] spent 0,0091 seconds in ashift_bicubic
9,0539 [opencl_profiling] spent 0,0087 seconds in exposure
9,0539 [opencl_profiling] spent 0,0087 seconds in colorin_unbound
9,0540 [opencl_profiling] spent 0,0175 seconds in colorspaces_transform_lab_to_rgb_matrix
9,0540 [opencl_profiling] spent 0,0087 seconds in channelmixerrgb_CAT16
9,0540 [opencl_profiling] spent 0,0208 seconds in colorspaces_transform_rgb_matrix_to_lab
9,0541 [opencl_profiling] spent 0,3545 seconds in eaw_decompose
9,0542 [opencl_profiling] spent 0,3404 seconds in eaw_synthesize
9,0542 [opencl_profiling] spent 0,0084 seconds in colorbalancergb
9,0542 [opencl_profiling] spent 0,0084 seconds in rgblevels
9,0543 [opencl_profiling] spent 0,0087 seconds in sigmoid_loglogistic_per_channel
9,0543 [opencl_profiling] spent 0,0065 seconds in pad_input
9,0544 [opencl_profiling] spent 0,0350 seconds in gauss_reduce
9,0544 [opencl_profiling] spent 0,0304 seconds in process_curve
9,0545 [opencl_profiling] spent 0,2108 seconds in laplacian_assemble
9,0545 [opencl_profiling] spent 0,0095 seconds in write_back
9,0545 [opencl_profiling] spent 0,0091 seconds in colorout
9,0546 [opencl_profiling] spent 2,7437 seconds totally in command queue (with 0 events missing)
9,0546 [dev_process_export] pixel pipeline processing took 6,379 secs (2,188 CPU)

And all informations before log (sorry, I don’t know how to create such a nice collapsible code parts using markup like you did).

darktable 4.6.1
Copyright (C) 2012-2024 Johannes Hanika and other contributors.
Compile options:
Bit depth → 64 bit
Debug → DISABLED
SSE2 optimizations → ENABLED
OpenMP → ENABLED
OpenCL → ENABLED
Lua → ENABLED - API version 9.2.0
Colord → DISABLED
gPhoto2 → ENABLED
GMIC → ENABLED - Compressed LUTs are supported
GraphicsMagick → ENABLED
ImageMagick → DISABLED
libavif → ENABLED
libheif → ENABLED
libjxl → ENABLED
OpenJPEG → ENABLED
OpenEXR → ENABLED
WebP → ENABLED
See resources | darktable for detailed documentation.
See Sign in to GitHub ¡ GitHub to report bugs.
(darktable-cli.exe:47544): Gtk-WARNING **: 21:58:17.086: gtk_disable_setlocale() must be called before gtk_init()
0,0881 [dt_get_sysresource_level] switched to 3 as `unrestricted’
0,0894 total mem: 16319MB
0,0901 mipmap cache: 2039MB
0,0908 available mem: 261116MB
0,0914 singlebuff: 16319MB
0.0950 [opencl_init] opencl library ‘OpenCL.dll’ found on your system and loaded, preference ‘default path’
0.1162 [opencl_init] found 1 platform
[opencl_init] found 1 device
[dt_opencl_device_init]
DEVICE: 0: ‘gfx1032’
PLATFORM, VENDOR & ID: AMD Accelerated Parallel Processing, Advanced Micro Devices, Inc., ID=4098
CANONICAL NAME: amdacceleratedparallelprocessinggfx1032
DRIVER VERSION: 3608.0 (PAL,LC)
DEVICE VERSION: OpenCL 2.0 AMD-APP (3608.0)
DEVICE_TYPE: GPU, dedicated mem
GLOBAL MEM SIZE: 8176 MB
MAX MEM ALLOC: 6732 MB
MAX IMAGE SIZE: 16384 x 16384
MAX WORK GROUP SIZE: 256
MAX WORK ITEM DIMENSIONS: 3
MAX WORK ITEM SIZES: [ 1024 1024 1024 ]
ASYNC PIXELPIPE: NO
PINNED MEMORY TRANSFER: NO
USE HEADROOM: 600Mb
AVOID ATOMICS: NO
MICRO NAP: 250
ROUNDUP WIDTH & HEIGHT 16x16
CHECK EVENT HANDLES: 128
TILING ADVANTAGE: 0.000
DEFAULT DEVICE: NO
KERNEL BUILD DIRECTORY: C:\Program Files\darktable\share\darktable\kernels
KERNEL DIRECTORY: C:\Users\sylwe\AppData\Local\Microsoft\Windows\INetCache\darktable\cached_v3_kernels_for_AMDAcceleratedParallelProcessinggfx1032_36080PALLC
CL COMPILER OPTION: -cl-fast-relaxed-math
CL COMPILER COMMAND: -w -cl-fast-relaxed-math -DAMD=1 -I"C:\Program Files\darktable\share\darktable\kernels"
KERNEL LOADING TIME: 0.0727 sec
[opencl_init] OpenCL successfully initialized. internal numbers and names of available devices:
[opencl_init] 0 ‘AMD Accelerated Parallel Processing gfx1032’
0.7908 [opencl_init] FINALLY: opencl is AVAILABLE and ENABLED.
[opencl_init] opencl_scheduling_profile: ‘very fast GPU’
[opencl_init] opencl_device_priority: ‘/!0,///!0,*’
[opencl_init] opencl_mandatory_timeout: 400
[dt_opencl_update_priorities] these are your device priorities:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 0 0 0 0
[dt_opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 1 1 1 1 1
[opencl_synchronization_timeout] synchronization timeout set to 0
[dt_opencl_update_priorities] these are your device priorities:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 0 0 0 0 0
[dt_opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[dt_opencl_update_priorities] image preview export thumbs preview2
[dt_opencl_update_priorities] 1 1 1 1 1
[opencl_synchronization_timeout] synchronization timeout set to 0

That’s what I also see, but I don’t think a faster CPU would solve that; those are all GPU-only timings.

For example, denoiseprofile was faster on my GPU, using tiling, than on yours, without it:

mine:  3.9984 [dev_pixelpipe] took 1.406 secs (0.866 CPU) [export] processed `denoiseprofile' on GPU with tiling, blended on CPU
yours: 4,9699 [dev_pixelpipe] took 1,749 secs (0,000 CPU) [export] processed denoiseprofile' on GPU, blended on GPU

lens:

mine:  4.5705 [dev_pixelpipe] took 0.572 secs (1.564 CPU) [export] processed `lens' on GPU, blended on GPU
yours: 5,9107 [dev_pixelpipe] took 0,941 secs (2,172 CPU) [export] processed lens’ on GPU, blended on GPU

atrous:

mine:  5.9407 [dev_pixelpipe] took 1.213 secs (1.770 CPU) [export] processed `atrous' on GPU with tiling, blended on CPU
yours: 7,8978 [dev_pixelpipe] took 1,847 secs (0,000 CPU) [export] processed atrous' on GPU, blended on GPU

Then a surpsie, my bilat ran on the CPU, yours on the GPU:

mine:  7.6203 [dev_pixelpipe] took 1.449 secs (8.338 CPU) [export] processed `bilat' on CPU, blended on CPU
yours: 8,7702 [dev_pixelpipe] took 0,740 secs (0,000 CPU) [export] processed bilat' on GPU, blended on GPU

Transfers and copies – a mixed bag: sometimes (like the first and fourth pairs) yours is 7-10x faster; sometimes, like the 2nd pair, twice as slow:

mine:  7.8418 [opencl_profiling] spent 0.5348 seconds in [Write Image (from host to device)]
yours: 9,0525 [opencl_profiling] spent 0,0553 seconds in [Write Image (from host to device)]

mine:  7.8418 [opencl_profiling] spent 0.1928 seconds in [Write Buffer (from host to device)]
yours: 9,0527 [opencl_profiling] spent 0,3848 seconds in [Write Buffer (from host to device)]

mine:  7.8418 [opencl_profiling] spent 0.0000 seconds in [Read Buffer (from device to host)]
yours: 9,0528 [opencl_profiling] spent 0,0004 seconds in [Read Buffer (from device to host)]

mine:  7.8418 [opencl_profiling] spent 1.0297 seconds in [Read Image (from device to host)]
yours: 9,0529 [opencl_profiling] spent 0,1666 seconds in [Read Image (from device to host)]

mine:  7.8419 [opencl_profiling] spent 0.0659 seconds in [Copy Image (on device)]
yours: 9,0537 [opencl_profiling] spent 0,0790 seconds in [Copy Image (on device)]
1 Like

OK, did some tests on my son’s recently built PC with Ryzen 5 7600 andd RX 7600.
CPU only: 11.2 sec
CPU+GPU: 5.1 sec
It is slightly faster card than my W6600, but still this result is far from 2.8 seconds results of benchmark on slower CPU (Ryzen 7 2700x) and the same GPU (RX 7600) from mentioned by me list:
GPU benchmarks in darktable (dartmouth.edu)
I ran some OpenCL benchmarks using Compubench and turned out in most of operations my card is about 2x as fast as my previous RX 570. In Darktable difference is maybe 15%.
I tried also some older versions of AMD drivers with little changes. So to me it seems Windows version of DT is for some reason slower when using Open CL than on Linux. Test using only CPU are on par with similar hardwa running Linux. It could be also a problem with Windows 11, which I have on my and my son’s PCs, but mentioned OpenCL tests do not show this. In the free time I’ll try to install Ubuntu on another disk on my PC to do the same benchmarks.

It’s probably a driver issue, then. In darktable, the code is written in OpenCL, which is device-independent. When you launch darktable for the first time, or after upgrading the graphics driver, it asks the driver to build the driver-specific ‘kernel’ from the OpenCL source.

So maybe Windows OpenCL driver is off then? But as I mentioned I tried to clean uninstall and then install a few (much) older AMD drivers with little to no impact. On the other side - 6 vs 16 seconds I get now on my CPU is still a huge improvement. I’m curious however how would it be with Ryzen 9 7900, which achieves in these Darktable benchmarks similar results to my GPU? If OpenCL benchmarks would be the same as CPU, then what? :slight_smile: